plotting categorical data by row from a dataframe - python

I have some data showing a machine's performance. one of the columns is for when the pipe it makes fails a particular quality check causing the machine to automatically cut the pipe. Depending on the machine and the way it's set up this happens around 1% of the time and I am trying to make a plot that shows the failure rate against time - my theory is that the longer some of the tools have been in use, the more failures they produce.
Here is an example of the excel file the machine makes every 24 hours.
The column "Cut Event" is the one I am interested in. In the snip the "/" symbol indicates no cut was made, when a cut is made it the cell in that column will say "speed", "ovality" or "thickness" as a reason (in German). What I want to do I go through a dataframe and only capture rows that have a failure, i.e. not a forward slash.
Here is what I have from reading through SO and other tutorials. The machine "speaks" German btw, hence the longer words,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#fig = plt.gcf()
df = pd.read_excel("W03 tool with cuts and dates.xlsx",
dtype=object)
df = df[['Time','Cut_Event']]
df['Cut_Event'].loc[df['Cut_Event'] == 'Geschwindigkeitsschwankung'] = 'Speed Cut Event'
df['Cut_Event'].loc[df['Cut_Event'] == 'Kugelfehler'] = 'Kugel Cut Event'
df['Cut_Event'].loc[df['Cut_Event'] == '/'] = 'No Cut Event'
print (df)
What I am stuck on is passing these events over to be plotted. My python learned so far has been about plotting everything in a particular column of a numerical dataframe, rather than just specific events of categorical data and I am getting errors as a result. I tried seaborn but got nowhere.
All help genuinely appreciated.
edit: Adding the dataset
Datum WKZ_code Time Rad_t1 Not Important Cut_Event
10 Sep W03 00:00:00 100 250 /
10 Sep W03 00:00:01 100 250 /
10 Sep W03 00:00:02 100 250 /
10 Sep W03 00:00:03 100 250 /
10 Sep W03 00:00:04 100 250 /
10 Sep W03 00:00:00 100 250 Speed Cut

Related

Mapping data frame descriptions based on values of multiple columns

I need to generate a mapping dataframe with each unique code and a description I want prioritised, but need to do it based off a set of prioritisation options. So for example the starting dataframe might look like this:
Filename TB Period Company Code Desc. Amount
0 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 98 100
1 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 7 200
2 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1000 ZX -100
3 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1000 29 -200
4 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1001 BA 100
5 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1001 9 200
6 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 ARC -100
7 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 86 -200
The options I have for prioritisation of descriptions are:
Firstly to search for viable options in each Period, so for example Closing first, then if not found Opening, then if not found Prior.
If multiple descriptions are in the prioritised period, prioritise either longest or first instance.
So for example, if I wanted prioritisation of Closing, then Opening, then Prior, with longest string, I should get a mapping dataframe that looks like this:
Code New Desc.
FOXTROT__1000 29
FOXTROT__1001 ARC
Just for context, I have a fairly simple way to do all this in tkinter, but its dependent on generating a GUI of inconsistent codes and comboboxes of their descriptions, which is then used to generate a mapping dataframe.
The issue is that for large volumes (>1000 up to 30,000 inconsistent codes), it becomes impractical to generate a GUI, so for large volumes I need this as a way to auto-generate the mapping dataframe directly from the initial data whilst circumventing tkinter entirely.
import numpy as np
import pandas as df
#Create a new column which shows the hierarchy given the value of Period
df['NewFilterColumn'] = np.where( df['Period'] == 'Closing', 1,
np.where(df['Period'] == 'Opening', 2,
np.where(df['Period'] == 'Prior', 3, None
)
)
)
df = df.sort_values(by = ['NewFilterColumn', 'Code','New Desc.'], ascending = True, axis = 0)

Pandas dataframe interpolation with logarithmically sampled time intervals

I have a pandas dataframe with columns 'Time' in minutes and 'Value' pulled in from a data logger. The data are logged in logarithmic time intervals, meaning that the first values are logged in fractional minutes then as time proceeds the time intervals get longer:
print(df)
Minutes Value
0 0.001 0.00100
1 0.005 0.04495
2 0.010 0.04495
3 0.015 0.09085
4 0.020 0.11368
.. ... ...
561 4275.150 269.17782
562 4285.150 266.90964
563 4295.150 268.35306
564 4305.150 269.42984
565 4315.150 268.37594
I would like to linearly interpolate the 'Value' at one minute intervals from 0 to 4315 minutes.
I have attempted a few different iterations of df.interpolate() however have not found success. Can someone please help me out? Thank you
I think it's possible that my question was very basic or I made a confusing question. Either way I just wrote a little loop to solve my problem and felt like I should share it. I am sure that this not the most efficient way of doing what I was asking and hopefully somebody could suggest better ways of accomplishing this. I am still very new a this whole thing.
First a few qualifying things:
The 'Value' data that I was talking about is called 'drawdown', which refers to a difference in water level from the initial starting water level inside a water well. It starts at 0.
This kind of data is often viewed in a semi-log plot and sometimes its easier to replace 0 with a very low number instead (i.e 0.0001) so that it plots easy in other programs.
This code takes a .csv file with column names 'Minutes' and 'Drawdown' and compares time values with a new reference dataframe of minutes from 0 through the end of the dataset. It references the 2 closest time values to the desired time value in the list and makes a weighted average of those values then creates a new csv of the integer minutes with drawdown.
Cheers!
# -*- coding: utf-8 -*-
"""
Created on Tue Sep 22 13:42:29 2020
#author: cmeyer
"""
import pandas as pd
import numpy as np
df=pd.read_csv('Read_in.csv')
length=len(df)-1
last=df.at[length,'Drawdown']
lengthpump=int(df.at[length,'Minutes'])
minutes=np.arange(0,lengthpump,1)
dfminutes=pd.DataFrame(minutes)
dfminutes.columns = ['Minutes']
for i in range(1, lengthpump, 1):
non_uni_minutes=df['Minutes']
uni_minutes=dfminutes.at[i,'Minutes']
close1=non_uni_minutes[np.argsort(np.abs(non_uni_minutes-uni_minutes))[0]]
close2=non_uni_minutes[np.argsort(np.abs(non_uni_minutes-uni_minutes))[1]]
index1 = np.where(non_uni_minutes == close1)
index1 = int(index1[0])
index2 = np.where(non_uni_minutes == close2)
index2 = int(index2[0])
num1=df.at[index1,'Drawdown']
num2=df.at[index2,'Drawdown']
weight1 = 1-abs((i-close1)/i)
weight2 = 1-abs((i-close2)/i)
Value = (weight1*num1+weight2*num2)/(weight1+weight2)
dfminutes.at[i,'Drawdown'] = Value
dfminutes.at[0,'Drawdown'] = 0.000001
dfminutes.at[0,'Minutes'] = 0.000001
dfminutes.to_csv('integer_minutes_drawdown.csv')
Here I implemented efficient solution using numpy.interp. I've coded a bit fancy way of reading-in data into pandas.DataFrame from string, you may use any simpler suitable way for your needs like pandas.read_csv(...).
Try next code here online!
import math
import pandas as pd, numpy as np
# Here is just fancy way of reading data, use any other method of reading instead
df = pd.DataFrame([map(float, line.split()) for line in """
0.001 0.00100
0.005 0.04495
0.010 0.04495
0.015 0.09085
0.020 0.11368
4275.150 269.17782
4285.150 266.90964
4295.150 268.35306
4305.150 269.42984
4315.150 268.37594
""".splitlines() if line.strip()], columns = ['Time', 'Value'])
a = df.values
# Create array of integer x = [0 1 2 3 ... LastTimeFloor].
x = np.arange(math.floor(a[-1, 0] + 1e-6) + 1)
# Linearly interpolate
y = np.interp(x, a[:, 0], a[:, 1])
df = pd.DataFrame({'Time': x, 'Value': y})
print(df)

How to create visualization from time series data in a .txt file in python

I have a .txt file with three columns: Time, ticker, price. The time is spaced in 15 second intervals. It looks like this uploaded to jupyter notebook and put into a Pandas DF.
time ticker price
0 09:30:35 EV 33.860
1 00:00:00 AMG 60.430
2 09:30:35 AMG 60.750
3 00:00:00 BLK 455.350
4 09:30:35 BLK 451.514
... ... ... ...
502596 13:00:55 TLT 166.450
502597 13:00:55 VXX 47.150
502598 13:00:55 TSLA 529.800
502599 13:00:55 BIDU 103.500
502600 13:00:55 ON 12.700
# NOTE: the first set of data has the data at market open for -
# every other time point, so that's what the 00:00:00 is.
#It is only limited to the 09:30:35 data.
I need to create a function that takes an input (a ticker) and then creates a bar chart that displays the data with 5 minute ticks ( the data is every 20 seconds, so for every 15 points in time).
So far I've thought about separating the "mm" part of the hh:mm:ss to just get the minutes in another column and then right a for loop that looks something like this:
for num in df['mm']:
if num %5 == 0:
print('tick')
then somehow appending the "tick" to the "time" column for every 5 minutes of data (I'm not sure how I would do this), then using the time column as the index and only using data with the "tick" index in it (some kind of if statement). I'm not sure if this makes sense but I'm drawing a blank on this.
You should have a look at the built-in functions in pandas. In the following example I'm using a date + time format but it shouldn't be hard to convert one to the other.
Generate data
%matplotlib inline
import pandas as pd
import numpy as np
dates = pd.date_range(start="2020-04-01", periods=150, freq="20S")
df1 = pd.DataFrame({"date":dates,
"price":np.random.rand(len(dates))})
df2 = df1.copy()
df1["ticker"] = "a"
df2["ticker"] = "b"
df = pd.concat([df1,df2], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
Resample Timeseries every 5 minutes
Here you can try to see the output of
df1.set_index("date")\
.resample("5T")\
.first()\
.reset_index()
Where we are considering just the first element at 05:00, 10:00 and so on. In general to do the same for every ticker we need a groupby
out = df.groupby("ticker")\
.apply(lambda x: x.set_index("date")\
.resample("5T")\
.first()\
.reset_index())\
.reset_index(drop=True)
Plot function
def plot_tick(data, ticker):
ts = data[data["ticker"]==ticker].reset_index(drop=True)
ts.plot(x="date", y="price", kind="bar", title=ticker);
plot_tick(out, "a")
Then you can improve the plot or, eventually, try to use plotly.

Fixing a TypeError when using pandas after replacing a string with a floating point number to eventually generate a bar graph of monthly total precip?

I am new to python and using pandas and I am needing help creating a bar graph after changing a string 'T' to 0.005 inches of rain in a daily precip column. Below is some sample csv data:
Date HighT Avgt LowT Precip
2017-01-01 46 35 24 T
2017-01-02 54 48 41 0
2017-01-03 54 45 34 0.33
2017-01-04 30 24 19 0.36
The csv file is daily weather data for the year 2017. I have daily precipitation amounts and replaced 'T' (which means trace amounts) with 0.005 using pandas. I converted the dates to use as my index from the csv file.
%matplotlib inline
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
wt = pd.read_csv('CSV/2017-weather.csv')
wt['Date'] = pd.to_datetime(wt['Date'], format = '%m/%d/%y')
wt.index = wt['Date']
del wt['Date']
wt.loc[wt['Precip']=='T', ['Precip']] = 0.005
In attempt to plot the bar graph I first tried this:
wt.groupby(wt.index.month)['Precip'].sum().plot(kind='bar')
However, I keep getting a TypeError saying: "Can't convert 'float' object to str implicitly."
I then tried making the 'Precip' column its own data frame, converting it to all floating points, and then tried to make the bar graph again:
wt2 = wt[['Precip']]
wt2.astype(float)
wt2.groupby(wt2.index.month)['Precip'].sum().plot(kind='bar')
I still get the same error I mentioned above and I'm not sure how to fix the error at this point.
Any help would be appreciated!
If you look at the first row of data in the data you provide, you can see that the Precip value is "T". As this isn't a number, it can't be converted to a float, so the TypeError is thrown.
Try removing that first row and your code should work.

Add two Pandas Series or DataFrame objects in-place?

I have a dataset where we record the electrical power demand from each individual appliance in the home. The dataset is quite large (2 years or data; 1 sample every 6 seconds; 50 appliances). The data is in a compressed HDF file.
We need to add the power demand for every appliance to get the total aggregate power demand over time. Each individual meter might have a different start and end time.
The naive approach (using a simple model of our data) is to do something like this:
LENGHT = 2**25
N = 30
cumulator = pd.Series()
for i in range(N):
# change the index for each new_entry to mimick the fact
# that out appliance meters have different start and end time.
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator = cumulator.add(new_entry, fill_value=0)
This works fine for small amounts of data. It also works OK with large amounts of data as long as every new_entry has exactly the same index.
But, with large amounts of data, where each new_entry has a different start and end index, Python quickly gobbles up all the available RAM. I suspect this is a memory fragmentation issue. If I use multiprocessing to fire up a new process for each meter (to load the meter's data from disk, load the cumulator from disk, do the addition in memory, then save the cumulator back to disk, and exit the process) then we have fine memory behaviour but, of course, all that disk IO slows us down a lot.
So, I think what I want is an in-place Pandas add function. The plan would be to initialise cumulator to have an index which is the union of all the meters' indicies. Then allocate memory once for that cumulator. Hence no more fragmentation issues.
I have tried two approaches but neither is satisfactory.
I tried using numpy.add to allow me to set the out argument:
# Allocate enough space for the cumulator
cumulator = pd.Series(0, index=np.arange(0, LENGTH+N))
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
cumulator, aligned_new_entry = cumulator.align(new_entry, copy=False, fill_value=0)
del new_entry
np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values)
del aligned_new_entry
But this gobbles up all my RAM too and doesn't seem to do the addition. If I change the penaultiate line to cumulator.values = np.add(cumulator.values, aligned_new_entry.values, out=cumulator.values) then I get an error about not being able to assign to cumulator.values.
This second approach appears to have the correct memory behaviour but is far too slow to run:
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i))
for index in cumulator.index:
try:
cumulator[index] += new_entry[index]
except KeyError:
pass
I suppose I could write this function in Cython. But I'd rather not have to do that.
So: is there any way to do an 'inplace add' in Pandas?
Update
In response to comments below, here is a toy example of our meter data and the sum we want. All values are watts.
time meter1 meter2 meter3 sum
09:00:00 10 10
09:00:06 10 20 30
09:00:12 10 20 30
09:00:18 10 20 30 50
09:00:24 10 20 30 50
09:00:30 10 30 40
If you want to see more details then here's the file format description of our data logger, and here's the 4TByte archive of our entire dataset.
After messing around a lot with multiprocessing, I think I've found a fairly simple and efficient way to do an in-place add without using multiprocessing:
import numpy as np
import pandas as pd
LENGTH = 2**26
N = 10
DTYPE = np.int
# Allocate memory *once* for a Series which will hold our cumulator
cumulator = pd.Series(0, index=np.arange(0, N+LENGTH), dtype=DTYPE)
# Get a numpy array from the Series' buffer
cumulator_arr = np.frombuffer(cumulator.data, dtype=DTYPE)
# Create lots of dummy data. Each new_entry has a different start
# and end index.
for i in range(N):
new_entry = pd.Series(1, index=np.arange(i, LENGTH+i), dtype=DTYPE)
aligned_new_entry = np.pad(new_entry.values, pad_width=((i, N-i)),
mode='constant', constant_values=((0, 0)))
# np.pad could be replaced by new_entry.reindex(index, fill_value=0)
# but np.pad is faster and more memory efficient than reindex
del new_entry
np.add(cumulator_arr, aligned_new_entry, out=cumulator_arr)
del aligned_new_entry
del cumulator_arr
print cumulator.head(N*2)
which prints:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 10
11 10
12 10
13 10
14 10
15 10
16 10
17 10
18 10
19 10
assuming that your dataframe looks something like:
df.index.names == ['time']
df.columns == ['meter1', 'meter2', ..., 'meterN']
then all you need to do is:
df['total'] = df.fillna(0, inplace=True).sum(1)

Categories

Resources