Plotting in Python extracting only specific columns from a CSV - python

EDIT: as suggested shortening the question:
Quite new to python and programming, and I would like to plot the 1st and 4th column into a log(x) log(y) graph. And honestly I don't knot how to extract only the two columns i need from this.
16:58:58 | 2.090 | 26.88 | 1.2945E-9 | 45.8
16:59:00 | 2.031 | 27.00 | 1.3526E-9 | 132.1
16:59:02 | 2.039 | 26.90 | 1.3843E-9 | 178.5
16:59:04 | 2.031 | 26.98 | 1.4628E-9 | 228.9
16:59:06 | 2.031 | 27.04 | 1.5263E-9 | 259.8
16:59:08 | 2.027 | 26.84 | 1.6010E-9 | 271.8

Using pandas:
import pandas as pd
df = pd.read_csv("data.txt", delimiter="\s[|]\s+", header=None, index_col=0)
df.plot(y=4)
(Note that this ignores the logarithmic scaling because it's not clear what the logarithm of a time should be)

If you want to not use the excellent pandas, here is a steam approach.
import matplotlib.pyplot as plt
import math
import datetime as dt
test = """16:58:58 | 2.090 | 26.88 | 1.2945E-9 | 45.8\n
16:59:00 | 2.031 | 27.00 | 1.3526E-9 | 132.1\n
16:59:02 | 2.039 | 26.90 | 1.3843E-9 | 178.5\n
16:59:04 | 2.031 | 26.98 | 1.4628E-9 | 228.9\n
16:59:06 | 2.031 | 27.04 | 1.5263E-9 | 259.8\n
16:59:08 | 2.027 | 26.84 | 1.6010E-9 | 271.8\n"""
lines = [line for line in test.splitlines() if line != ""]
# Here is the real code
subset = []
for line in lines:
parts = line.split('|')
ts = dt.datetime.strptime(parts[0].strip(), "%H:%M:%S")
num = math.log(float(parts[3].strip()))
subset.append((ts, num))
# now there is a list of tuples with your datapoints, looking like
# [(datetime.datetime(1900, 1, 1, 16, 58, 58), 1.2945E-9), (datetime.datetime(1900, 1, 1, 16, 59), ...]
# I made this list intentionally so that you can see how one can gather everything in a tidy way from the
# raw string data.
# Now lets separate things for plotting
times = [elem[0] for elem in subset]
values = [elem[1] for elem in subset]
# now to plot, I'm going to use the matplotlib plot_date function.
plt.figure()
plt.plot_date(times, values)
# do some formatting on the date axis
plt.gcf().autofmt_xdate()
plt.show()

Related

How to boxplot different columns from a dataframe (y axis) vs groupby a range of hours (x axis) using plotly

Good morning,
I'm trying to boxplot the 'columns' from 1 to 6 vs the 'ElapsedTime(hours)' column with the use of plotly library.
Here is my dataframe :
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| Date | Time | Column1 | Column2 | Column3 | Column4 | Column5 | Column6 | ElapsedTime(hours) |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:48:37 | 0,011535 | 8,4021 | 0,00027 | 0,027806 | 8,431 | 0,000362 | 0 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 06:59:37 | 0,013458 | 8,4421 | 0,000314 | 0,032214 | 8,4738 | 0,000416 | 0,183333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 29/07/2021 | 07:14:37 | 0,017793 | 8,4993 | 0,000384 | 0,038288 | 8,5372 | 0,000486 | 0,433333333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:12:50 | 0,018808 | 8,545 | 0,000414 | 0,042341 | 8,5891 | 0,000539 | 24,9702778 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:42:50 | 0,025931 | 8,3627 | 0,000534 | 0,032379 | 8,3556 | 0,000557 | 25,9036111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 30/07/2021 | 08:57:50 | 0,025164 | 8,5518 | 0,000505 | 0,041134 | 8,6516 | 0,000254 | 26,1536111 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:45:28 | 0,026561 | 8,6266 | 0,000533 | 0,050387 | 8,6718 | 0,00065 | 46,9475 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 05:55:28 | 0,027744 | 8,6455 | 0,000543 | 0,051511 | 8,6916 | 0,000664 | 47,11416667 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
| 31/07/2021 | 06:05:28 | 0,028854 | 8,485 | 0,000342 | 0,05693 | 8,6934 | 0,000695 | 47,28083333 |
+------------+----------+----------+---------+----------+----------+---------+----------+--------------------+
for now, i just know how to boxplot each column vs nothing using these lines of code :
import warnings
import pandas as pd
import plotly.graph_objects as go
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category= UserWarning)
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
fig = go.Figure()
fig.add_trace(
go.Box(y=da['Column1'], name='Column1'))
fig.add_trace(
go.Box(y=da['Column2'], name='Column2'))
fig.add_trace(
go.Box(y=da['Column3'], name='Column3'))
fig.add_trace(
go.Box(y=da['Column4'], name='Column4'))
fig.add_trace(
go.Box(y=da['Column5'], name='Column5'))
fig.add_trace(
go.Box(y=da['Column6'], name='Column6'))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
from plotly import offline
offline.plot(fig)
Output :
I can choose to show one column :
What i want (if possible) : Plot my columns from 1 to 6 vs a range of ElapsedTime(hours). For exemple i choose to have a range of 10 hours, so the boxplots will be taking in consideration that range and plot all the values of that range into one box.
PS : if i add x=da['ElapsedTime(hours)'] inside the go.Box(), i will be ploting each value of columns 1 to 6 versus one value from the ElapsedTime column and i don't want that, I want a box in a range of an ElapsedTime.
Extra : If possible, i want the columns from 1 to 6 to be in a dropdown button so that i can click and choose which column i wanna see in the range of the ElapsedTime i choosed.
Thank you for your time, and have a great day !
EDIT :#################################################
I tried these lines. The problem is that i have an error saying dataframe doesn't have a name argument (name=data.name) and if i get rid of that, let's say i don't use name=data.name, i will get a plot that is not Box. Do you have any idea on how to overcome this problem ?
da["DateTime"] = pd.to_datetime(da.Date + " " + da.Time)
columns = [c for c in da.columns if c.startswith("Column")]
da.set_index("DateTime")[columns].resample("1D")
fig = go.Figure()
for start_datetime, data in da.set_index("DateTime")[columns].resample("1D"):
fig.add_trace(
go.Box(x=data.index, y=data.values, name=data.name))
fig.update_layout(legend=dict(
yanchor="top",
y=1.24,
xanchor="left",
x=0.15
))
fig.update_layout(boxmode='group')
from plotly import offline
offline.plot(fig)
Here are some suggestions.
Merge the Date and Time columns into a DateTime column:
import pandas as pd
da = pd.DataFrame()
da['Date'] = ["29/07/2021", "29/07/2021", "29/07/2021", "30/07/2021", "30/07/2021", "30/07/2021", "31/07/2021", "31/07/2021", "31/07/2021"]
da['Time'] = ["06:48:37", "06:59:37", "07:14:37", "08:12:50", "08:42:50", "08:57:50", "05:45:28", "05:55:28", "06:05:28"]
da["Column1"] = [0.011534891, 0.013458399, 0.017792937, 0.018807581, 0.025931434, 0.025163517, 0.026561283, 0.027743659, 0.028854]
da["Column2"] = [8.4021, 8.4421, 8.4993, 8.545, 8.3627, 8.5518, 8.6266, 8.6455, 8.485]
da["Column3"] = [0.000270475, 0.000313769, 0.000383506, 0.000414331, 0.000533619, 0.000505081, 0.000533131, 0.000543031, 0.000342]
da["Column4"] = [0.027806399, 0.032213984, 0.038287754, 0.042340721, 0.032378571, 0.041134106, 0.050387029, 0.051511238, 0.05693]
da["Column5"] = [8.431, 8.4738, 8.5372, 8.5891, 8.3556, 8.6516, 8.6718, 8.6916, 8.6934]
da["Column6"] = [0.000362081, 0.000416463, 0.000486275, 0.000539244, 0.000556613, 0.000253831, 0.00064975, 0.000664063, 0.000695]
da["ElapsedTime(hours)"] = [0, 0.183333333, 0.433333333, 24.9702778, 25.9036111, 26.1536111, 46.9475, 47.11416667, 47.28083333]
da["DateTime"] = pd.to_datetime(df.Date + " " + df.Time)
df = da # df is more natural for me ;)
I use the following to mark the interesting columns:
columns = [c for c in df.columns if c.startswith("Column")]
Use the aggregate method, to aggregate data over some time range.
For example to aggregate over one day, use
df.set_index("DateTime")[columns].resample("1D")
The result is an object, that you can either run some aggregations on, e.g. compute the mean for each such sample:
df.set_index("DateTime")[columns].resample("1D").mean()
If you want to leverage plotly's functionality to create the boxplot, I would use a loop though:
for start_datetime, data in df.set_index("DateTime")[columns].resample("1D"):
print(start_datetime)
print(data)
print()
Instead of the print functions, use the plotly commands to create a box in the boxplot.

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

pandas plot columns from two dataframes in in one figure

I have a dataframe consisting of mean and std-dev of distributions
df.head()
+---+---------+----------------+-------------+---------------+------------+
| | user_id | session_id | sample_mean | sample_median | sample_std |
+---+---------+----------------+-------------+---------------+------------+
| 0 | 1 | 20081023025304 | 4.972789 | 5 | 0.308456 |
| 1 | 1 | 20081023025305 | 5.000000 | 5 | 1.468418 |
| 2 | 1 | 20081023025306 | 5.274419 | 5 | 4.518189 |
| 3 | 1 | 20081024020959 | 4.634855 | 5 | 1.387244 |
| 4 | 1 | 20081026134407 | 5.088195 | 5 | 2.452059 |
+---+---------+----------------+-------------+---------------+------------+
From this, I plot a histogram of the distribution
plt.hist(df['sample_mean'],bins=50)
plt.xlabel('sampling rate (sec)')
plt.ylabel('Frequency')
plt.title('Histogram of trips mean sampling rate')
plt.show()
I then write a function to compute pdf and cdf, passing dataframe and column name:
def compute_distrib(df, col):
stats_df = df.groupby(col)[col].agg('count').pipe(pd.DataFrame).rename(columns = {col: 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
return stats_df
So for example:
stats_df = compute_distrib(df, 'sample_mean')
stats_df.head(2)
+---+---------------+-----------+----------+----------+
| | sample_median | frequency | pdf | cdf |
+---+---------------+-----------+----------+----------+
| 0 | 1 | 4317 | 0.143575 | 0.143575 |
| 1 | 2 | 10169 | 0.338200 | 0.481775 |
+---+---------------+-----------+----------+----------+
Then I plot the cdf distribution this way:
ax1 = stats_df.plot(x = 'sample_mean', y = ['cdf'], grid = True)
ax1.legend(loc='best')
Goal:
I would like to plot these figures in one figure side-by-side instead of plotting separately and somehow putting them together in my slides.
You can use matplotlib.pyplot.subplots to draw multiple plots next to each other:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=1, ncols=2)
# Pass the data you wish to plot.
axs[0][0].hist(...)
axs[0][1].plot(...)
plt.show()

Delete and add column to pandas dataframe - Python 3.x

I am trying to accomplish something I thought would be easy: Take three columns from my dataframe, use a label encoder to encode them, and simply replace the current values with the new values.
I have a dataframe that looks like this:
| Order_Num | Part_Num | Site | BUILD_ID |
| MO100161015 | PPT-100K39 | BALT | A001 |
| MO100203496 | MDF-925R36 | BALT | A001 |
| MO100203498 | PPT-825R34 | BALT | A001 |
| MO100244071 | MDF-323DCN | BALT | A001 |
| MO100244071 | MDF-888888 | BALT | A005 |
I am essentially trying to use sklearn's LabelEncoder() to switch my String variables to numeric. Currently, I have a function str_to_num where I feed it a column and it returns me an array (column) of the converted data. It works great.
However, I am struggling to remove the old data from my dataframe and add it to the new. My script is below:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
# Convert the passed in column
def str_to_num(arr):
le = preprocessing.LabelEncoder()
array_of_parts = []
for x in arr:
array_of_parts.append(x)
new_arr = le.fit_transform(array_of_parts)
return new_arr
# read in data from csv
data = pd.read_csv('test.csv')
print(data)
# Create the new data
converted_column = str_to_num(data['Order_Num'])
print(converted_column)
# How can I replace data['Order_Num'] with the values in converted_column?
# Drop the old data
dropped = data.drop('Order_Num', axis=1)
# Add the new_data column to the place where the old data was?
Given my current script, how can I replace the values in the 'Order_Num' column with those in converted_column? I have tried [pandas.DataFrame.replace][1], but that replaces specific values, and I don't know how to map that to the returned data.
I would hope my expected data to be:
| Order_Num | Part_Num | Site | BUILD_ID |
| 0 | PPT-100K39 | BALT | A001 |
| 1 | MDF-925R36 | BALT | A001 |
| 2 | PPT-825R34 | BALT | A001 |
| 3 | MDF-323DCN | BALT | A001 |
| 3 | MDF-888888 | BALT | A005 |
My python --version returns
3.6.7
The beauty of pandas is sometimes understated - often you only need to do something like this:
data['Order_Num'] = str_to_num(data['Order_Num'])
There's also the option of df.apply()

Plotting a line plot with error bars and datapoints from a pandas DataFrame

I've been racking my brain to try to figure out how to plot a pandas DataFrame the way I want but to no avail.
The DataFrame has a MultiIndex and it looks like this:
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| | | | | | run_001 | run_002 | run_003 | run_004 | run_005 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| file_type | server_count | file_count | thread_count | cacheclear_type | | | | | |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
| gor | 01servers | 05files | 20threads | ccALWAYS | 15.918 | 16.275 | 15.807 | 17.781 | 16.233 |
| gor | 01servers | 10files | 20threads | ccALWAYS | 17.322 | 17.636 | 16.096 | 16.484 | 16.715 |
| gor | 01servers | 15files | 20threads | ccALWAYS | 19.265 | 17.128 | 17.630 | 18.739 | 16.833 |
| gor | 01servers | 20files | 20threads | ccALWAYS | 23.744 | 20.539 | 21.416 | 22.921 | 22.794 |
+-----------+--------------+------------+--------------+-----------------+---------+---------+---------+---------+---------+
What I want to do is plot a line graph where the x values are the 'file_count' value, and the y value for each is the average of all the run_xxx values for the corresponding line in the DataFrame.
If possible I would like to add error bars and even the data points themselves so that I can see the distribution of the data behind that average.
Here's a (crappy) mockup of roughly what I'm talking about:
I've been able to create a boxplot using the boxplot() function built into pandas' DataFrame by doing:
df.transpose().boxplot()
This looks almost okay but a little bit cluttered and doesn't have the actual data points plotted.
Beeswarm plot will very nice in this situation, especially when you have a lot of dots and what to show the distributions of those dots. You need to, however, supply the position parameter to beeswarm as by default it will started at 0. The the boxplot method of pandas DataFrame, on the other hand, plots boxes at x = 1, 2 ...
It comes down to just these:
from beeswarm import *
D1 = beeswarm(df.values, positions = np.arange(len(df.values))+1)
D2 = df.transpose().boxplot(ax=D1[1])
For completeness I'll include the way I finally managed to do this here:
import numpy as np
import matplotlib.pyplot as plt
import random
dft = df.sortlevel(2).transpose()
fig, ax = plt.subplots()
x = []
y = []
y_err = []
scatterx = []
scattery = []
for n, col in enumerate(dft.columns):
x.append(n)
y.append(np.mean(dft[col]))
y_err.append(np.std(dft[col]))
for v in dft[col]:
scattery.append(v)
scatterx.append(n + ((random.random()-0.5)*0.05))
p = plt.plot(x, y, label=label)
color=p[0].get_color()
plt.errorbar(x, y, yerr=y_err, fmt=color)
plt.scatter(scatterx, scattery, alpha=0.3, color=color)
plt.legend(loc=2)
ax.set_xticks(range(len(dft.columns)))
ax.set_xticklabels([x[2] for x in dft.columns])
plt.show()
This will show a line chart with error bars and data points. There may be some errors in the above code. I copied it and simplified a bit before pasting here.

Categories

Resources