the code below returns a blank plot in Python:
# import libraries
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
os.chdir('file path')
# import data files
activity = pd.read_csv('file path\dailyActivity_merged.csv')
intensity = pd.read_csv('file path\hourlyIntensities_merged.csv')
steps = pd.read_csv('file path\hourlySteps_merged.csv')
sleep = pd.read_csv('file path\sleepDay_merged.csv')
# ActivityDate in activity df only includes dates (no time). Rename it Dates
activity = activity.rename(columns={'ActivityDate': 'Dates'})
# ActivityHour in intensity df and steps df includes date-time. Split date-time column into dates and times in intensity. Drop the date-time column
intensity['Dates'] = pd.to_datetime(intensity['ActivityHour']).dt.date
intensity['Times'] = pd.to_datetime(intensity['ActivityHour']).dt.time
intensity = intensity.drop(columns=['ActivityHour'])
# split date-time column into dates and times in steps. Drop the date-time column
steps['Dates'] = pd.to_datetime(steps['ActivityHour']).dt.date
steps['Times'] = pd.to_datetime(steps['ActivityHour']).dt.time
steps = steps.drop(columns=['ActivityHour'])
# split date-time column into dates and times in sleep. Drop the date-time column
sleep['Dates'] = pd.to_datetime(sleep['SleepDate']).dt.date
sleep['Times'] = pd.to_datetime(sleep['SleepDate']).dt.time
sleep = sleep.drop(columns=['SleepDate', 'TotalSleepRecords'])
# add a column & calculate time_awake_in_bed before falling asleep
sleep['time_awake_in_bed'] = sleep['TotalTimeInBed'] - sleep['TotalMinutesAsleep']
# merge activity and sleep
list = ['Id', 'Dates']
activity_sleep = sleep.merge(activity,
on = list,
how = 'outer')
# plot relation between calories used daily vs how long it takes users to fall asleep
plt.scatter(activity_sleep['time_awake_in_bed'], activity_sleep['Calories'], s=20, c='b', marker='o')
plt.axis([0, 200, 0, 5000])
plt.show()
NOTE: max(Calories) = 4900 and min(Calories) =0. max(time_awake_in_bed) = 0 and min(time_awake_in_bed) = 150
Please let me know how I can get a scatter plot out of this. Thank you in advance for any help.
The same variables from the same data-frame work perfectly with geom_point() in R.
I found where the problem was. As #Redox and #cheersmate mentioned in comments, the data-frame that I created by merging included NaN values. I fixed this by merging them only on 'Id'. Then I could create a scatter plot:
list = ['Id']
activity_sleep = sleep.merge(activity,
on = list,
how = 'outer')
The column "Dates" is not a good one to merge on, as in each data frame the same dates are repeated in multiple rows. Also I noticed that I get the same plot whether I outer or inner merge. Thank you.
Related
I have passed my time series data,which is essentially measurements from a sensor about pressure, through a Fourier transformation, similar to what is described in https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101.
The file used can be found here:
https://docs.google.com/spreadsheets/d/1MLETSU5Trl5gLGO6pv32rxBsR8xZNkbK/edit?usp=sharing&ouid=110574180158524908052&rtpof=true&sd=true
The code related is this :
import pandas as pd
import numpy as np
file='test.xlsx'
df=pd.read_excel(file,header=0)
#df=pd.read_csv(file,header=0)
df.head()
df.tail()
# drop ID
df=df[['JSON_TIMESTAMP','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_ADH_COATWEIGHT_SP']]
# extract year month
df["year"] = df["JSON_TIMESTAMP"].str[:4]
df["month"] = df["JSON_TIMESTAMP"].str[5:7]
df["day"] = df["JSON_TIMESTAMP"].str[8:10]
df= df.sort_values( ['year', 'month','day'],
ascending = [True, True,True])
df['JSON_TIMESTAMP'] = df['JSON_TIMESTAMP'].astype('datetime64[ns]')
df.sort_values(by='JSON_TIMESTAMP', ascending=True)
df1=df.copy()
df1 = df1.set_index('JSON_TIMESTAMP')
df1 = df1[["ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB"]]
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,7))
plt.rcParams["figure.figsize"] = (25,8)
df1.plot()
#df.plot(style='k. ')
plt.show()
df1.hist(bins=20)
from scipy.fft import rfft,rfftfreq
## https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101
# convert into x and y
x = list(range(len(df1.index)))
y = df1['ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB']
# apply fast fourier transform and take absolute values
f=abs(np.fft.fft(df1))
# get the list of frequencies
num=np.size(x)
freq = [i / num for i in list(range(num))]
# get the list of spectrums
spectrum=f.real*f.real+f.imag*f.imag
nspectrum=spectrum/spectrum[0]
# plot nspectrum per frequency, with a semilog scale on nspectrum
plt.semilogy(freq,nspectrum)
nspectrum
type(freq)
freq= np.array(freq)
freq
type(nspectrum)
nspectrum = nspectrum.flatten()
# improve the plot by adding periods in number of days rather than frequency
import pandas as pd
results = pd.DataFrame({'freq': freq, 'nspectrum': nspectrum})
results['period'] = results['freq'] / (1/365)
plt.semilogy(results['period'], results['nspectrum'])
# improve the plot by convertint the data into grouped per day to avoid peaks
results['period_round'] = results['period'].round()
grouped_day = results.groupby('period_round')['nspectrum'].sum()
plt.semilogy(grouped_day.index, grouped_day)
#plt.xticks([1, 13, 26, 39, 52])
My end result is this :
Result of Fourier Trasformation for Data
My question is, what does this eventually show for our data, and intuitively what does the spike at the last section mean?What can I do with such result?
Thanks in advance all!
Just to be upfront, I am a Mechanical Engineer with limited coding experience thou I have some programming classes under my belt( Java, C++, and lisp)
I have inherited this code from my predecessor and am just trying to make it work for what I'm doing with it. I need to iterate through an excel file that has column A values of 0, 1, 2, and 3 (in the code below this correlates to "Revs" ) but I need to pick out all the value = 0 and put into a separate folder, and again for value = 2, etc.. Thank you for bearing with me, I appreciate any help I can get
import pandas as pd
import numpy as np
import os
import os.path
import xlsxwriter
import matplotlib.pyplot as plt
import six
import matplotlib.backends.backend_pdf
from matplotlib.gridspec import GridSpec
from matplotlib.ticker import AutoMinorLocator, MultipleLocator
def CamAnalyzer(entryName):
#Enter excel data from file as a dataframe
df = pd.read_excel (str(file_loc) + str(entryName), header = 1) #header 1 to get correct header row
print (df)
#setup grid for plots
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(17,22))
gs = GridSpec(3,2, figure=fig)
props = dict(boxstyle='round', facecolor='w', alpha=1)
#create a list of 4 smaller dataframes by splitting df when the rev count changes and name them
dfSplit = list(df.groupby("Revs"))
names = ["Air Vent","Inlet","Diaphram","Outlet"]
for x, y in enumerate(dfSplit):
#for each smaller dataframe #x,(df-y), create a polar plot and assign it to a space in the grid
dfs = y[1]
r = dfs["Measurement"].str.strip(" in") #radius measurement column has units. ditch em
r = r.apply(pd.to_numeric) + zero/2 #convert all values in the frame to a float
theta = dfs["Rads"]
if x<2:
ax = fig.add_subplot(gs[1,x],polar = True)
else:
ax = fig.add_subplot(gs[2,x-2],polar = True)
ax.set_rlim(0,0.1) #set limits to radial axis
ax.plot(theta, r)
ax.grid(True)
ax.set_title(names[x]) #nametag
#create another subplot in the grid that overlays all 4 smaller dataframes on one plot
ax2 = fig.add_subplot(gs[0,:],polar = True)
ax2.set_rlim(0,0.1)
for x, y in enumerate(dfSplit):
dfs = y[1]
r = dfs["Measurement"].str.strip(" in")
r = r.apply(pd.to_numeric) + zero/2
theta = dfs["Rads"]
ax2.plot(theta, r)
ax2.set_title("Sample " + str(entryName).strip(".xlsx") + " Overlayed")
ax2.legend(names,bbox_to_anchor=(1.1, 1.05)) #place legend outside of plot area
plt.savefig(str(file_loc) + "/Results/" + str(entryName).strip(".xlsx") + ".png")
print("Results Saved")
I'm on my phone, so I can't check exact code examples, but this should get you started.
First, most of the code you posted is about graphing, and therefore not useful for your needs. The basic approach: use pandas (a library), to read in the Excel sheet, use the pandas function 'groupby' to split that sheet by 'Revs', then iterate through each Rev, and use pandas again to write back to a file. Copying the relevant sections from above:
#this brings in the necessary library
import pandas as pd
#Read excel data from file as a dataframe
#header should point to the row that describes your columns. The first row is row 0.
df = pd.read_excel("filename.xlsx", header = 1)
#create a list of 4 smaller dataframes using GroupBy.
#This returns a 'GroupBy' object.
dfSplit = df.groupby("Revs")
#iterate through the groupby object, saving each
#iterating over key (name) and value (dataframes)
#use the name to build a filename
for name, frame in dfSplit:
frame.to_excel("Rev "+str(name)+".xlsx")
Edit: I had a chance to test this code, and it should now work. This will depend a little on your actual file (eg, which row is your header row).
I need to plot some spectral data as a 2D image, where each data point corresponds to a spectrum with a specific date/time. I require to plot all spectra as follows:
- xx-axis - corresponds to the wavelenght
- yy-axis - corresponds to the date/time
- intensity - corresponds to the flux
If my datapoints were continuous/sequential in time I would just use matplotlib's imshow. However, not only the points are not all continuous/sequential in time but I have large time gaps between points.
here is some simulated data that mimics what I have:
import numpy as np
sampleSize = 100
data={}
for time in np.arange(0,5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(14,20):
data[time] = np.random.sample(sampleSize)
for time in np.arange(30,40):
data[time] = np.random.sample(sampleSize)
for time in np.arange(25.5,35.5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(80,120):
data[time] = np.random.sample(sampleSize)
if I needed to print only one of the subsets of data above; i would do:
mplt.imshow([data[time] for time in np.arange(0,5)], cmap ='Greys',aspect='auto',origin='lower',interpolation="none",extent=[-50,50,0,5])
mplt.show()
however, I have no idea how I can print all data in the same plot, while showing the gaps and keeping the yy-axis as the time. Any ideas?
thanks,
Jorge
Or you can use pandas to help you with sorting the keys, then reindex:
df = pd.DataFrame(data).T
plt.imshow(df.reindex(np.arange(df.index.max())),
cmap ='Greys',
aspect='auto',
origin='lower',
interpolation="none",
extent=[-50,50,0,5])
Output:
In the end I ended up using a different approach:
1) re-index the time in my data so that no two arrays has the same time and I avoid non-integer indexes
nTimes = 1
timeIndexes=[int(float(index)) for index in data.keys()]
while len(timeIndexes) != len(set(timeIndexes)):
nTimes += 1
timeIndexes=[int(nTimes*float(index)) for index in data.keys()]
timeIndexesDict = {str(int(nTimes*float(index))):data[index] for index in data.keys()}
lenData2Plot = max([int(key) for key in timeIndexesDict.keys()])
2) create an array of zeros with the number of columns like my data and a number of rows corresponding to my maximum re-indexed time
data2Plot = np.zeros((int(lenData2Plot)+1,sampleSize))
3) replace the rows in my array of zeros corresponding to my re-indeed times
for index in timeIndexesDict.keys():
data2Plot[int(index)][:] = timeIndexesDict[str(index)]
4) plot as I normally would plot an array with no gaps
mplt.imshow(data2Plot,
cmap='Greys',aspect='auto',origin='lower',interpolation="none",
extent=[-50,50,0,120])
mplt.show()
I have converted a continuous dataset to categorical. I am getting nan values when ever the value of the continuous data is 0.0 after conversion. Below is my code
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins)
category = category.to_frame()
print (category)
How do I convert the values so that I dont get NaN values. I have attached two screenshots for better understanding how the actual data looks and how the convert data looks. This is the main dataset. This is the what it becomes after using bins and pandas.cut(). How can thos "0.00" stays like the other values in the dataset.
When using pd.cut, you can specify the parameter include_lowest = True. This will make the first internal left inclusive (it will include the 0 value as your first interval starts with 0).
So in your case, you can adjust your code to be
import pandas as pd
import matplotlib as plt
df = pd.read_csv('NSL-KDD/KDDTrain+.txt',header=None)
data = df[33]
bins = [0.000,0.05,0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50,0.55,0.60,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1.00]
category = pd.cut(data,bins,include_lowest=True)
category = category.to_frame()
print (category)
Documentation Reference for pd.cut
I am trying to plot a table in python. I have it working and such...but when I plot my table, it doesn't plot the Pass/Fail column at the end how I have it written. It seems that the columns are being displayed in alphabetical order.
How do I disable this.
I want to add the last column but just as one row. Basically a big check box but when I do that it gives me an error that the arrays must be the same length, which makes sense...but how can I go around this and just have one large column with no rows..?
import pandas as pd
import matplotlib.pyplot as plt
MinP_M=5
Min_M=6
Per_M=7
Per_G=8
Per2_M=9
PerFlat_M=10
MaxPL_M=11
Max_M=12
GF_M =13
fig1 = plt.figure()
fig1.set_size_inches(8.7,11.75,forward=True)
ax1=fig1.add_subplot(111)
ax1.axis('off')
ax1.axis('tight')
data2={'Min':['%s'%MinP_M,'%s'%Min_M,'',''],
'Typ':['%s'%Per_M,'%s'%Per_G,'%s'%Per2_M,'+/- %s'%PerFlat_M],
'Max':['%s'%MaxPL_M,'','%s'%Max_M,'+/- %s'%GF_M],
'Pass/Fail':['','','','']
}
df2 = pd.DataFrame(data2)
the_table2=ax1.table(cellText=df2.values,colWidths=[0.15]*5,rowLabels=['A','B','C', 'D'],colLabels=df2.columns,loc='center')
plt.show()
The first part is relatively easy to solve. As you create your pandas data frame using a dict, the order of keywords and thus the order of columns is not fixed. To get the ordering correct, use the columns keyword. The second part was a bit more tricky. The solution I found here is to overlay your original table with a second table and then adding another cell to that second table that has the same height as the four cells of the original table. For that you have to first obtain the cell dictionary from the table instance and sum up the heights of the table rows. Please see the code below:
import pandas as pd
import matplotlib.pyplot as plt
MinP_M=5
Min_M=6
Per_M=7
Per_G=8
Per2_M=9
PerFlat_M=10
MaxPL_M=11
Max_M=12
GF_M =13
fig1 = plt.figure()
##this line entirely messed up the plot for me (on Mac):
##fig1.set_size_inches(8.7,11.75,forward=True)
ax1=fig1.add_subplot(111)
ax1.axis('off')
ax1.axis('tight')
data2={'Min':['%s'%MinP_M,'%s'%Min_M,'',''],
'Typ':['%s'%Per_M,'%s'%Per_G,'%s'%Per2_M,'+/- %s'%PerFlat_M],
'Max':['%s'%MaxPL_M,'','%s'%Max_M,'+/- %s'%GF_M],
'Pass/Fail':['','','','']
}
##fix the column ordering with a list:
keys = ['Min', 'Typ', 'Max', 'Pass/Fail']
df2 = pd.DataFrame(data2, columns=keys)
##defining the size of the table cells
row_label_width = 0.05
col_width = 0.15
col_height = 0.05
the_table2=ax1.table(
cellText=df2.values,
colWidths=[col_width]*4,
rowLabels=['A','B','C', 'D'],
colLabels=df2.columns,
##loc='center', ##this has no effect if the bbox keyword is used
bbox = [0,0,col_width*4,col_height*5],
)
celld = the_table2.get_celld()
##getting the heights of the header and the columns:
row_height_tot = 0
for (i,j),cell in celld.items():
if j==3 and i>0: #last column, but not the header
row_height_tot += cell.get_height()
the_table3=ax1.table(
cellText=['0'], ##cannot be empty
colLabels=df2.columns[-1:],
colWidths=[col_width],
bbox = [col_width*3,0,col_width,col_height*5],
)
the_table3.add_cell(1,0,col_width,row_height_tot)
fig1.tight_layout()
plt.show()
I had to turn off some of your formatting options as they gave weird results on my computer. If you want to have the table centred, play with the bbox options in the table commands. The final result looks like this:
Hope this helps.