I'm testing a light sensor for sensitivity. I now have data that I would like to plot.
The sensor has 24 levels of sensitivity
I'm only testing 0,6,12,18 and 23
On the x-axes: PWM value, range 0-65000
My goal is to plot from a dataframe using with plotly.
My question is:
How can I combine the data (as shown below) into a dataframe for plotting?
EDIT: The link to my csv files: https://filetransfer.io/data-package/QwzFzT8O
Also below: my code so far
Thanks!
def main_code():
data = pd.DataFrame(columns=['PWM','sens_00','sens_06','sens_12','sens_18','sens_23'])
sens_00 = pd.read_csv('sens_00.csv', sep=';')
sens_06 = pd.read_csv('sens_06.csv', sep=';')
sens_12 = pd.read_csv('sens_12.csv', sep=';')
sens_18 = pd.read_csv('sens_18.csv', sep=';')
sens_23 = pd.read_csv('sens_23.csv', sep=';')
print(data)
print(sens_23)
import plotly.express as px
import pandas as pd
if __name__ == '__main__':
main_code()
#Dawid's answer is fine, but it does not produce a complete dataframe (so you can do more than just plotting), and contains too much redundancy.
Below is a better way to concatenate the multiple csv files.
Then plotting is just a single call.
Reading csv files into a single dataframe:
from pathlib import Path
import pandas as pd
def read_dataframes(data_root: Path):
# It can be turned into a single line
# but keeping it more readable here
dataframes = []
for fpath in data_root.glob("*.csv"):
df = pd.read_csv(fpath, sep=";")
df = df[["pwm", "lux"]]
df = df.rename({"lux": fpath.stem}, axis="columns")
df = df.set_index("pwm")
dataframes.append(df)
return pd.concat(dataframes)
data_root = Path("data")
df = read_dataframes(data_root)
df
sens_06 sens_18 sens_12 sens_23 sens_00
pwm
100 0.00000 NaN NaN NaN NaN
200 1.36435 NaN NaN NaN NaN
300 6.06451 NaN NaN NaN NaN
400 12.60010 NaN NaN NaN NaN
500 20.03770 NaN NaN NaN NaN
... ... ... ... ... ...
64700 NaN NaN NaN NaN 5276.74
64800 NaN NaN NaN NaN 5282.29
64900 NaN NaN NaN NaN 5290.45
65000 NaN NaN NaN NaN 5296.63
65000 NaN NaN NaN NaN 5296.57
[2098 rows x 5 columns]
Plotting:
df.plot(backend="plotly") # equivalent to px.line(df)
Here is my suggestion. You have two columns in each file, and you need to use unique column names to keep both columns. All files are loaded and appended to the empty DataFrame called data. To generate a plot with all columns, you need to specify it by fig.add_scatter. The code:
import pandas as pd
import plotly.express as px
def main_code():
data = pd.DataFrame()
for filename in ['sens_00', 'sens_06', 'sens_12', 'sens_18', 'sens_23']:
data[['{}-PWM'.format(filename), '{}-LUX'.format(filename)]] = pd.read_csv('{}.csv'.format(filename), sep=';')
print(data)
fig = px.line(data_frame=data, x=data['sens_00-PWM'], y=data['sens_00-LUX'])
for filename in ['sens_06', 'sens_12', 'sens_18', 'sens_23']:
fig.add_scatter(x=data['{}-PWM'.format(filename)], y=data['{}-LUX'.format(filename)], mode='lines')
fig.show()
if __name__ == '__main__':
main_code()
Based on the suggestion by #Dawid
This is what I was going for.
Related
I'm working with several csv with Pandas. I changed some data name on the original csv file and saved the file. Then, I restarted and reloaded my jupyter notebook but now I got something like this for all dataframe I charged the data source :
Department Zone Element Product Year Unit Value
0 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
1 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
2 U1,"Z5","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
3 U1,"Z6","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
4 U1,"Z9","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
I tried to use sep=',', encoding='UTF-8-SIG',quotechar='"', quoting=0, engine='python' but same issue. I don't know how to parse the csv because even when I created a new csv form the data (without the quote and separator as ; ) the same issue appears...
csv is 321 rows, as this example with the problem : https://www.cjoint.com/c/LDCmfvq06R6
and original csv file without problem in Pandas : https://www.cjoint.com/c/LDCmlweuR66
I thing problem with quotes of the file
import csv
df = pd.read_csv('LDCmfvq06R6_FAOSTAT.csv', quotechar='"',
delimiter = ',',
quoting=csv.QUOTE_NONE,
on_bad_lines='skip')
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df.head()
Good afternoon
I am trying to import more than a 100 separate .txt files containing data I want to plot. I would like to automise this process, since doing the same iteration for every individual file is most tedious.
I have read up on how to read multiple .txt files, and found a nice explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here.
Although I can at least see my data now, I have no clue how to plot it, since the data is in one column separated by \t, e.g.
0 Extension (mm)\tLoad (kN)\tMachine extension (mm)\tPreload extension
1 0.000000\t\t\t
2 0.152645\t0.000059312\t.....
... etc.
I have tried using different separators in both the pd.read_csv() and pd.read_fwf() including ' ', '\t' and '-s+', but to now avail.
Of course this causes a problem, because now I can not plot my data. Speaking of, I am also not sure how to plot the data in the dataframe. I want to plot each .txt file's data separately on the same scatter plot.
I am very new to stack overflow, so pardon the format of the question if it does not conform to the normal standard. I attach my code below, but unfortunately I can not attach my .txt files. Each .txt file contains about a thousand rows of data. I attach a picture of the general format of all the files. General format of the .txt files.
import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
# get the file names
leggername = [i for i in glob.glob("*.txt")]
# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df
EDIT: the output I get now for the DataFrame is:
[ Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
... ...
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
... ...
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
... ...
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
... etc
The basic gist is to skip the first data row (that has a single value in it), then read the individual files with pd.read_csv, using tab as the separator, and stack them together.
There is, however, a more problematic issue: the data files turn out to be UTF-16 encoded (the binary data show a NUL character at the even positions), but there is no byte-order-mark (BOM) to indicate this. As a result, you can't specify the encoding in read_csv, but have to manually read each file as binary, then decode it with UTF-16 to a string, then feed that string to read_csv. Since the latter requires a filename or IO-stream, the text data needs to be put into a StringIO object first (or save the corrected data to disk first, then read the corrected file; might not be a bad idea).
import pandas as pd
import os
import glob
import io
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
dfs = []
for filename in glob.glob("*.txt"):
with open(filename, 'rb') as fp:
data = fp.read() # a single file should fit in memory just fine
# Decode the UTF-16 data that is missing a BOM
string = data.decode('UTF-16')
# And put it into a stream, for ease-of-use with `read_csv`
stream = io.StringIO(string)
# Read the data from the, now properly decoded, stream
# Skip the single-value row, and use tabs as separators
df = pd.read_csv(stream, sep='\t', skiprows=[1])
# To keep track of the individual files, add an "origin" column
# with its value set to the corresponding filename
df['origin'] = filename
dfs.append(df)
# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)
# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin'
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')
Seaborn's scatterplot documentation.
I have some long winded code here with an issue when I am attempting to join (or merge/concat) two datasets together, I get this TypeError: Cannot compare type 'Timestamp' with type 'int'
The two datasets both come from resampling the same initial starting dataset. The master_hrs df is a resampling process using the a change point algorithm Python package called rupters. (pip install ruptures to run code). daily_summary df is just using Pandas to resample daily mean & sum values. But I get the error when I attempt to combine the datasets together. Would anyone have any tips to try?
Making up some fake data generates the same error as my real world dataset. I think the issue I have is I am trying to compare datime to numpy some how... Any tips greatly appreciated. Thanks
import ruptures as rpt
import calendar
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='H')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
def changPointDf(df):
arr = np.array(df.Value)
#Define Binary Segmentation search method
model = "l2"
algo = rpt.Binseg(model=model).fit(arr)
my_bkps = algo.predict(n_bkps=5)
# getting the timestamps of the change points
bkps_timestamps = df.iloc[[0] + my_bkps[:-1] +[-1]].index
# computing the durations between change points
durations = (bkps_timestamps[1:] - bkps_timestamps[:-1])
#hours calc
d = durations.seconds/60/60
d_f = pd.DataFrame(d)
df2 = d_f.T
return df2
master_hrs = pd.DataFrame()
for idx, days in df.groupby(df.index.date):
changPoint_df = changPointDf(days)
values = changPoint_df.values.tolist()
master_hrs=master_hrs.append(values)
master_hrs.columns = ['overnight_AM_hrs', 'moring_startup_hrs', 'moring_ramp_hrs', 'high_load_hrs', 'evening_shoulder_hrs']
daily_summary = pd.DataFrame()
daily_summary['Temperature'] = df['Temperature'].resample('D').mean()
daily_summary['Value'] = df['Value'].resample('D').sum()
final_df = daily_summary.join(master_hrs)
The issue was the indexes themselves - master_hrs was int64 whereas daily_summary was datetime. Include this before joining the two dataframes together:
master_hrs.index = pd.to_datetime(master_hrs.index)
Just for clarity, here's my output of final_df:
Temperature Value ... high_load_hrs evening_shoulder_hrs
2019-01-01 0.417517 12.154527 ... NaN NaN
2019-01-02 0.521131 13.811842 ... NaN NaN
2019-01-03 0.583205 12.568966 ... NaN NaN
2019-01-04 0.448225 14.036136 ... NaN NaN
2019-01-05 0.542870 10.738192 ... NaN NaN
... ... ... ... ...
2024-09-10 0.470421 13.775528 ... NaN NaN
2024-09-11 0.384672 10.473930 ... NaN NaN
2024-09-12 0.527284 14.000231 ... NaN NaN
2024-09-13 0.555646 11.460867 ... NaN NaN
2024-09-14 0.426003 3.763975 ... NaN NaN
[2084 rows x 7 columns]
Hopefully this gets you what you need.
I have dataframe similar to the one bellow
I want to remove text and keep digits only from each coloumn in that Dataframe
The expected output something like this
So far I have tried this
import json
import requests
import pandas as pd
URL = 'https://xxxxx.com'
req = requests.get(URL,auth=('xxx', 'xxx') )
text_data= req.text
json_dict= json.loads(text_data)
df = pd.DataFrame.from_dict(json_dict["measurements"])
cols_to_keep =['source','battery','c8y_TemperatureMeasurement','time','c8y_DistanceMeasurement']
df_final = df[cols_to_keep]
df_final = df_final.rename(columns={'c8y_TemperatureMeasurement': 'Temperature Or T','c8y_DistanceMeasurement':'Distance'})
for col in df_final:
df_final[col] = [''.join(re.findall("\d*\.?\d+", item)) for item in df_final[col]]
Your code is missing import pandas as pd and the data cannot be accessed, because it requires credentials.
You can use pandas.DataFrame.replace:
Example data:
df = pd.DataFrame({'a':['abc123abc', 'def456678'], 'b':['123a', 'b456']})
Dataframe:
a b
0 abc123abc 123a
1 def456678 b456
[^0-9.] replaces all non-digit characters.
df.replace('[^0-9.]', '', regex=True)
Output:
a b
0 123 123
1 456678 456
Edit:
The problem here is actually about nested JSON and not about replacing values in a dataframe. The reason the statement above does not work is because the data is saved as dicts in in the dataframe. But since the above mentioned solution is generally correct, it won't edit it out.
Revised Answer:
from pandas.io.json import json_normalize
import requests
import pandas as pd
URL = 'https://wastemanagement.post-iot.lu/measurement/measurements?source=83512& pageSize=1000000000&dateFrom=2019-10-26&dateTo=2019-10-28'
req = requests.get(URL,auth=('xxxx', 'xxxx') )
text_data= req.text
json_dict= json.loads(text_data)
df= json_normalize(json_dict['measurements'])
df = df_final.rename(columns={'source.id': 'source', 'battery.percent.value': 'battery', 'c8y_TemperatureMeasurement.T.value': 'Temperature Or T','c8y_DistanceMeasurement.distance.value':'Distance'})
cols_to_keep =['source' ,'battery', 'Temperature Or T', 'time', 'Distance']
df_final = df[cols_to_keep]
Output:
source battery Temperature Or T time Distance
0 83512 98.0 NaN 2019-10-26T00:00:06.494Z NaN
1 83512 NaN 23.0 2019-10-26T00:00:06.538Z NaN
2 83512 NaN NaN 2019-10-26T00:00:06.577Z 21.0
3 83512 98.0 NaN 2019-10-26T00:30:06.702Z NaN
4 83512 NaN 23.0 2019-10-26T00:30:06.743Z NaN
I have the following data from a CFD simulation:
Average value for X = 0.5080000265E-0003 to 0.2489200234E-0001
Z = -.3141592741E+0001
Time = 0.7000032425E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
0.1360714249E-0001 0.2564597656E+0006
0.1663095318E-0001 0.2564346563E+0006
0.1965476200E-0001 0.2564095625E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7010014057E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7020006657E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
As you can see from the above example, the data is split into several vertical sections by the time step header labeled Time. In each section, Y does not change but P_g does change. To plot the data, I need the P_g in each section to be listed in the next column. For example, this is how I need to recreate the data:
Y 0.7000032425E+1 0.7020006657E+1 ...
0.1511904760E-0002 0.2565604063E+0006 0.2549982656E+0006 ...
0.4535714164E-0002 0.2565349844E+0006 0.2549982656E+0006 ...
0.7559523918E-0002 0.2565098906E+0006 0.2549982656E+0006 ...
0.1058333274E-0001 0.2564848125E+0006 0.2549982656E+0006 ...
0.1360714249E-0001 0.2564597656E+0006 0.2549982656E+0006 ...
Using Pandas, I can read the data from the text file and create a new data frame with the Y values as the index (rows) and the Time values as the columns:
import pandas as pd
# Read in data from text file
# -------------------------------------------------------------------------
# data frame from text file contents, skip first 4 rows, separate by variable
# white space, no header
df = pd.read_table('ROP_s_SD.dat', skiprows=4, sep='\s*', header=None)
# Time data
# -------------------------------------------------------------------------
# data frame of the rows that contain the Time string
dftime = df.loc[df.ix[:,0].str.contains('Time')]
t = dftime[2].tolist() # time list
idx = dftime.index # index of rows containing Time string
# Y data
# -------------------------------------------------------------------------
# grab values for y to create index for new data frame
ido = idx[0]+2 # index of first y value
idf = idx[1] # index of last y value
y = [] # empty list to store y values
for i in range(ido, idf): # iterate through first section of y values
v = df.ix[i, 0] # get y value from data frame
y.append(float(v)) # add y value to y list
# New data frame
# ------------------------------------------------------------------------
# empty data frame with y as index and t as columns
dfnew = pd.DataFrame(None, index=y, columns=t)
print('dfnew is \n', dfnew.head())
The head of the empty data frame, dfnew.head() looks like the following:
7.000032 7.010014 7.020007 7.030043 7.040020 7.050035 7.060043
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
7.070004 7.080036 7.090022 ... 7.650011 7.660032 7.670026
0.001512 NaN NaN NaN ... NaN NaN NaN
0.004536 NaN NaN NaN ... NaN NaN NaN
0.007560 NaN NaN NaN ... NaN NaN NaN
0.010583 NaN NaN NaN ... NaN NaN NaN
0.013607 NaN NaN NaN ... NaN NaN NaN
7.680044 7.690029 7.700008 7.710012 7.720014 7.730019 7.740026
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
[5 rows x 75 columns]
The NaN in each column should contain the P_g values from that particular Time section. How can I add the P_g values from each section to their respective column?
The text file that I am reading can be downloaded here.
It looks like you've already done most of the hard work ... the following few lines will finish unraveling your DataFrame:
# Add one more element to idx for correct indexing on the last column
idx = list(idx)
idx.append(len(df))
# Loop over the idx locations to fill the columns
for i in range(len(dfnew.columns)):
dfnew.iloc[:, i] = df.iloc[idx[i]+2:idx[i+1], 1].values
The head of dfnew is now something like this for the first 3 columns:
7.000032 7.010014 7.020007
0.001512 0.2565604063E+0006 0.2565604063E+0006 0.2565604063E+0006
0.004536 0.2565349844E+0006 0.2565349844E+0006 0.2565349844E+0006
0.007560 0.2565098906E+0006 0.2565098906E+0006 0.2565098906E+0006
0.010583 0.2564848125E+0006 0.2564848125E+0006 0.2564848125E+0006
0.013607 0.2564597656E+0006 0.2564597656E+0006 0.2564597656E+0006
You have a lot of elements, so probably the best way to view the data is in 2D:
data = dfnew.astype(float).values
extent = [float(dfnew.columns[0]),
float(dfnew.columns[-1]),
float(dfnew.index[0]),
float(dfnew.index[-1])]
import matplotlib.pyplot as plt
plt.imshow(data, extent=extent, origin='lower')
plt.xlabel('Time')
plt.ylabel('Y')
BTW, it looks like all the values for P_g at each time in your example file are the same anyway ...
Two things. First, perhaps you could consider how you can reduce this to a 2d spreadsheet. What columns should go into each row? I suggest each row should contain Time, Y and P_g. Perhaps that can inform your strategy for handling your funky input format.
Second, for what Y value(s) are you trying to plot P_g v.s. Time? Your data appears to have 3 variables--you'll need to reduce to 2 dimensions in order to make a 2d plot. Do you want to plot the the mean of P_g for a particular Time value? Or do you want a 3d plot, where you plot Y v.s. P_g for each Time value? Assuming you adopt the row/col structure I suggested above, any of these can be easily done with pandas. Check out pandas groupby feature. Here's more detail on that.
EDIT: you've clarified both my questions. Try this:
import pandas, sys, numpy
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# main dataframe
df = pandas.DataFrame(columns=['Time','Y','P_g'])
text = open('ROP_s_SD.dat','r').read()
chunks = text.split("Time = ")
# ignore first chunk
chunks = chunks[1:]
for chunk in chunks:
time_str, rest_str = chunk.split('\n',1)
time = float(time_str)
chunk_df = pandas.DataFrame.from_csv(StringIO(rest_str), sep=r'\s*', index_col=False)
chunk_df['Time'] = time
# add new content to main dataframe
df = df.append(chunk_df)
# you should now have a DataFrame with columns 'Time','Y','P_g'
assert sorted(df.columns) == ['P_g', 'Time', 'Y']
# iterate over unique values of time
times = sorted(list(set(df['Time'])))
assert len(times) == len(chunks)
for i,time in enumerate(times):
chunk_data = df[df['Time'] == time]
# plot or do whatever you'd like with each segment
means = numpy.mean(chunk_data)
stds = numpy.std(chunk_data)
print 'Data for time %d (%0.4f): ' %(i, time)
print means, stds