Use Python and Pandas to split data in a text file - python

I have the following data from a CFD simulation:
Average value for X = 0.5080000265E-0003 to 0.2489200234E-0001
Z = -.3141592741E+0001
Time = 0.7000032425E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
0.1360714249E-0001 0.2564597656E+0006
0.1663095318E-0001 0.2564346563E+0006
0.1965476200E-0001 0.2564095625E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7010014057E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.4535714164E-0002 0.2565349844E+0006
0.7559523918E-0002 0.2565098906E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
... ...
0.1259419441E+0001 0.2549983125E+0006
0.1262443304E+0001 0.2549983125E+0006
0.1265467167E+0001 0.2549983125E+0006
0.1268491030E+0001 0.2549982656E+0006
Time = 0.7020006657E+0001
Y P_g
0.1511904760E-0002 0.2565604063E+0006
0.1058333274E-0001 0.2564848125E+0006
... ...
As you can see from the above example, the data is split into several vertical sections by the time step header labeled Time. In each section, Y does not change but P_g does change. To plot the data, I need the P_g in each section to be listed in the next column. For example, this is how I need to recreate the data:
Y 0.7000032425E+1 0.7020006657E+1 ...
0.1511904760E-0002 0.2565604063E+0006 0.2549982656E+0006 ...
0.4535714164E-0002 0.2565349844E+0006 0.2549982656E+0006 ...
0.7559523918E-0002 0.2565098906E+0006 0.2549982656E+0006 ...
0.1058333274E-0001 0.2564848125E+0006 0.2549982656E+0006 ...
0.1360714249E-0001 0.2564597656E+0006 0.2549982656E+0006 ...
Using Pandas, I can read the data from the text file and create a new data frame with the Y values as the index (rows) and the Time values as the columns:
import pandas as pd
# Read in data from text file
# -------------------------------------------------------------------------
# data frame from text file contents, skip first 4 rows, separate by variable
# white space, no header
df = pd.read_table('ROP_s_SD.dat', skiprows=4, sep='\s*', header=None)
# Time data
# -------------------------------------------------------------------------
# data frame of the rows that contain the Time string
dftime = df.loc[df.ix[:,0].str.contains('Time')]
t = dftime[2].tolist() # time list
idx = dftime.index # index of rows containing Time string
# Y data
# -------------------------------------------------------------------------
# grab values for y to create index for new data frame
ido = idx[0]+2 # index of first y value
idf = idx[1] # index of last y value
y = [] # empty list to store y values
for i in range(ido, idf): # iterate through first section of y values
v = df.ix[i, 0] # get y value from data frame
y.append(float(v)) # add y value to y list
# New data frame
# ------------------------------------------------------------------------
# empty data frame with y as index and t as columns
dfnew = pd.DataFrame(None, index=y, columns=t)
print('dfnew is \n', dfnew.head())
The head of the empty data frame, dfnew.head() looks like the following:
7.000032 7.010014 7.020007 7.030043 7.040020 7.050035 7.060043
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
7.070004 7.080036 7.090022 ... 7.650011 7.660032 7.670026
0.001512 NaN NaN NaN ... NaN NaN NaN
0.004536 NaN NaN NaN ... NaN NaN NaN
0.007560 NaN NaN NaN ... NaN NaN NaN
0.010583 NaN NaN NaN ... NaN NaN NaN
0.013607 NaN NaN NaN ... NaN NaN NaN
7.680044 7.690029 7.700008 7.710012 7.720014 7.730019 7.740026
0.001512 NaN NaN NaN NaN NaN NaN NaN
0.004536 NaN NaN NaN NaN NaN NaN NaN
0.007560 NaN NaN NaN NaN NaN NaN NaN
0.010583 NaN NaN NaN NaN NaN NaN NaN
0.013607 NaN NaN NaN NaN NaN NaN NaN
[5 rows x 75 columns]
The NaN in each column should contain the P_g values from that particular Time section. How can I add the P_g values from each section to their respective column?
The text file that I am reading can be downloaded here.

It looks like you've already done most of the hard work ... the following few lines will finish unraveling your DataFrame:
# Add one more element to idx for correct indexing on the last column
idx = list(idx)
idx.append(len(df))
# Loop over the idx locations to fill the columns
for i in range(len(dfnew.columns)):
dfnew.iloc[:, i] = df.iloc[idx[i]+2:idx[i+1], 1].values
The head of dfnew is now something like this for the first 3 columns:
7.000032 7.010014 7.020007
0.001512 0.2565604063E+0006 0.2565604063E+0006 0.2565604063E+0006
0.004536 0.2565349844E+0006 0.2565349844E+0006 0.2565349844E+0006
0.007560 0.2565098906E+0006 0.2565098906E+0006 0.2565098906E+0006
0.010583 0.2564848125E+0006 0.2564848125E+0006 0.2564848125E+0006
0.013607 0.2564597656E+0006 0.2564597656E+0006 0.2564597656E+0006
You have a lot of elements, so probably the best way to view the data is in 2D:
data = dfnew.astype(float).values
extent = [float(dfnew.columns[0]),
float(dfnew.columns[-1]),
float(dfnew.index[0]),
float(dfnew.index[-1])]
import matplotlib.pyplot as plt
plt.imshow(data, extent=extent, origin='lower')
plt.xlabel('Time')
plt.ylabel('Y')
BTW, it looks like all the values for P_g at each time in your example file are the same anyway ...

Two things. First, perhaps you could consider how you can reduce this to a 2d spreadsheet. What columns should go into each row? I suggest each row should contain Time, Y and P_g. Perhaps that can inform your strategy for handling your funky input format.
Second, for what Y value(s) are you trying to plot P_g v.s. Time? Your data appears to have 3 variables--you'll need to reduce to 2 dimensions in order to make a 2d plot. Do you want to plot the the mean of P_g for a particular Time value? Or do you want a 3d plot, where you plot Y v.s. P_g for each Time value? Assuming you adopt the row/col structure I suggested above, any of these can be easily done with pandas. Check out pandas groupby feature. Here's more detail on that.
EDIT: you've clarified both my questions. Try this:
import pandas, sys, numpy
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# main dataframe
df = pandas.DataFrame(columns=['Time','Y','P_g'])
text = open('ROP_s_SD.dat','r').read()
chunks = text.split("Time = ")
# ignore first chunk
chunks = chunks[1:]
for chunk in chunks:
time_str, rest_str = chunk.split('\n',1)
time = float(time_str)
chunk_df = pandas.DataFrame.from_csv(StringIO(rest_str), sep=r'\s*', index_col=False)
chunk_df['Time'] = time
# add new content to main dataframe
df = df.append(chunk_df)
# you should now have a DataFrame with columns 'Time','Y','P_g'
assert sorted(df.columns) == ['P_g', 'Time', 'Y']
# iterate over unique values of time
times = sorted(list(set(df['Time'])))
assert len(times) == len(chunks)
for i,time in enumerate(times):
chunk_data = df[df['Time'] == time]
# plot or do whatever you'd like with each segment
means = numpy.mean(chunk_data)
stds = numpy.std(chunk_data)
print 'Data for time %d (%0.4f): ' %(i, time)
print means, stds

Related

How to combine sensor data for plotting

I'm testing a light sensor for sensitivity. I now have data that I would like to plot.
The sensor has 24 levels of sensitivity
I'm only testing 0,6,12,18 and 23
On the x-axes: PWM value, range 0-65000
My goal is to plot from a dataframe using with plotly.
My question is:
How can I combine the data (as shown below) into a dataframe for plotting?
EDIT: The link to my csv files: https://filetransfer.io/data-package/QwzFzT8O
Also below: my code so far
Thanks!
def main_code():
data = pd.DataFrame(columns=['PWM','sens_00','sens_06','sens_12','sens_18','sens_23'])
sens_00 = pd.read_csv('sens_00.csv', sep=';')
sens_06 = pd.read_csv('sens_06.csv', sep=';')
sens_12 = pd.read_csv('sens_12.csv', sep=';')
sens_18 = pd.read_csv('sens_18.csv', sep=';')
sens_23 = pd.read_csv('sens_23.csv', sep=';')
print(data)
print(sens_23)
import plotly.express as px
import pandas as pd
if __name__ == '__main__':
main_code()
#Dawid's answer is fine, but it does not produce a complete dataframe (so you can do more than just plotting), and contains too much redundancy.
Below is a better way to concatenate the multiple csv files.
Then plotting is just a single call.
Reading csv files into a single dataframe:
from pathlib import Path
import pandas as pd
def read_dataframes(data_root: Path):
# It can be turned into a single line
# but keeping it more readable here
dataframes = []
for fpath in data_root.glob("*.csv"):
df = pd.read_csv(fpath, sep=";")
df = df[["pwm", "lux"]]
df = df.rename({"lux": fpath.stem}, axis="columns")
df = df.set_index("pwm")
dataframes.append(df)
return pd.concat(dataframes)
data_root = Path("data")
df = read_dataframes(data_root)
df
sens_06 sens_18 sens_12 sens_23 sens_00
pwm
100 0.00000 NaN NaN NaN NaN
200 1.36435 NaN NaN NaN NaN
300 6.06451 NaN NaN NaN NaN
400 12.60010 NaN NaN NaN NaN
500 20.03770 NaN NaN NaN NaN
... ... ... ... ... ...
64700 NaN NaN NaN NaN 5276.74
64800 NaN NaN NaN NaN 5282.29
64900 NaN NaN NaN NaN 5290.45
65000 NaN NaN NaN NaN 5296.63
65000 NaN NaN NaN NaN 5296.57
[2098 rows x 5 columns]
Plotting:
df.plot(backend="plotly") # equivalent to px.line(df)
Here is my suggestion. You have two columns in each file, and you need to use unique column names to keep both columns. All files are loaded and appended to the empty DataFrame called data. To generate a plot with all columns, you need to specify it by fig.add_scatter. The code:
import pandas as pd
import plotly.express as px
def main_code():
data = pd.DataFrame()
for filename in ['sens_00', 'sens_06', 'sens_12', 'sens_18', 'sens_23']:
data[['{}-PWM'.format(filename), '{}-LUX'.format(filename)]] = pd.read_csv('{}.csv'.format(filename), sep=';')
print(data)
fig = px.line(data_frame=data, x=data['sens_00-PWM'], y=data['sens_00-LUX'])
for filename in ['sens_06', 'sens_12', 'sens_18', 'sens_23']:
fig.add_scatter(x=data['{}-PWM'.format(filename)], y=data['{}-LUX'.format(filename)], mode='lines')
fig.show()
if __name__ == '__main__':
main_code()
Based on the suggestion by #Dawid
This is what I was going for.

Automising the plot of more than a 100 .txt files using pandas, NaN problems

Good afternoon
I am trying to import more than a 100 separate .txt files containing data I want to plot. I would like to automise this process, since doing the same iteration for every individual file is most tedious.
I have read up on how to read multiple .txt files, and found a nice explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here.
Although I can at least see my data now, I have no clue how to plot it, since the data is in one column separated by \t, e.g.
0 Extension (mm)\tLoad (kN)\tMachine extension (mm)\tPreload extension
1 0.000000\t\t\t
2 0.152645\t0.000059312\t.....
... etc.
I have tried using different separators in both the pd.read_csv() and pd.read_fwf() including ' ', '\t' and '-s+', but to now avail.
Of course this causes a problem, because now I can not plot my data. Speaking of, I am also not sure how to plot the data in the dataframe. I want to plot each .txt file's data separately on the same scatter plot.
I am very new to stack overflow, so pardon the format of the question if it does not conform to the normal standard. I attach my code below, but unfortunately I can not attach my .txt files. Each .txt file contains about a thousand rows of data. I attach a picture of the general format of all the files. General format of the .txt files.
import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
# get the file names
leggername = [i for i in glob.glob("*.txt")]
# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df
EDIT: the output I get now for the DataFrame is:
[ Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
... ...
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
... ...
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
[1002 rows x 4 columns],
Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
... ...
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001
from Preload (mm)
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN
... etc
The basic gist is to skip the first data row (that has a single value in it), then read the individual files with pd.read_csv, using tab as the separator, and stack them together.
There is, however, a more problematic issue: the data files turn out to be UTF-16 encoded (the binary data show a NUL character at the even positions), but there is no byte-order-mark (BOM) to indicate this. As a result, you can't specify the encoding in read_csv, but have to manually read each file as binary, then decode it with UTF-16 to a string, then feed that string to read_csv. Since the latter requires a filename or IO-stream, the text data needs to be put into a StringIO object first (or save the corrected data to disk first, then read the corrected file; might not be a bad idea).
import pandas as pd
import os
import glob
import io
# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")
dfs = []
for filename in glob.glob("*.txt"):
with open(filename, 'rb') as fp:
data = fp.read() # a single file should fit in memory just fine
# Decode the UTF-16 data that is missing a BOM
string = data.decode('UTF-16')
# And put it into a stream, for ease-of-use with `read_csv`
stream = io.StringIO(string)
# Read the data from the, now properly decoded, stream
# Skip the single-value row, and use tabs as separators
df = pd.read_csv(stream, sep='\t', skiprows=[1])
# To keep track of the individual files, add an "origin" column
# with its value set to the corresponding filename
df['origin'] = filename
dfs.append(df)
# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)
# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin'
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')
Seaborn's scatterplot documentation.

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

How to load a text file of data with many commented rows, into pandas?

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt():
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.
What am I getting wrong?
This one is quite tricky. Please try out the snippet code below:
import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
sep='\s+',
comment='%',
usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',
'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
'20y.Anomaly', '20y.Unc.'))
The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
Two of the rows are headers
There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
This solution separates the two tables in the file into separate dataframes.
This is not as nice as the other answer, but the data is properly separated into different dataframes.
The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
The important point separating air and ice data.
import requests
import pandas as pd
import math
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# specify the data from the ranges in the file
air_header1 = data[74].split() # not used
air_header2 = [v.strip() for v in data[75].split(',')]
# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]
h2o_header1 = data[2129].split() # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]
# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)
Without the header code
Simplify the code, by using a manual header list.
import pandas as pd
import requests
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
'Annual_Anomaly', 'Annual_Unc.',
'Five-year_Anomaly', 'Five-year_Unc.',
'Ten-year_Anomaly', 'Ten-year_Unc.',
'Twenty-year_Anomaly', 'Twenty-year_Unc.']
# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)
air
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
h2o
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

Can I seperate the values of a dictionary into multiple columns and still be able to plot them?

I want to seperate the values of a dictionary into multiple columns and still be able to plot them. At this moment all the values are in one column.
So concretely I would like to split all the different values in the list of values. And use the amount of values in the longest list as the amount of columns. So for all the shorter lists I would like to fill in the gaps with something like 'NA' so I can still plot it in seaborn.
This is the dictionary that I used:
dictio = {'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999], 'seq_2888': [5219.3359, 5365.4089], 'seq_1101': [4287.7417, 4422.8254], 'seq_107': [5825.695099999999, 5972.8073], 'seq_6946': [5179.3118, 5364.420900000001], 'seq_6162': [5531.503199999999, 5645.577399999999], 'seq_504': [4556.920899999999, 4631.959], 'seq_3535': [3396.1715999999997, 3446.1969999999997, 5655.896546], 'seq_4077': [4551.9108, 4754.0073,4565.987654,5668.9999976], 'seq_1626_unamb': [3724.3894999999998]}
This is the code for the dataframe:
df = pd.Series(dictio)
test=pd.DataFrame({'ID':df.index, 'Value':df.values})
seq_107 [5825.695099999999, 5972.8073]
seq_1101 [4287.7417, 4422.8254]
seq_1626_unamb [3724.3894999999998]
seq_2888 [5219.3359, 5365.4089]
seq_3535 [3396.1715999999997, 3446.1969999999997, 5655....
seq_4077 [4551.9108, 4754.0073, 4565.987654, 5668.9999976]
seq_418 [3716.3642000000004, 3796.4124000000006]
seq_504 [4556.920899999999, 4631.959]
seq_6162 [5531.503199999999, 5645.577399999999]
seq_6946 [5179.3118, 5364.420900000001]
seq_7009 [6236.9764, 6367.049999999999]
seq_9143_unamb [4631.958999999999]
Thanks in advance for the help!
Convert the Value column to a list of lists, and reload it into a new dataframe. Afterwards, call plot. Something like this -
df = pd.DataFrame(test.Value.tolist(), index=test.ID)
df
0 1 2 3
ID
seq_107 5825.6951 5972.8073 NaN NaN
seq_1101 4287.7417 4422.8254 NaN NaN
seq_1626_unamb 3724.3895 NaN NaN NaN
seq_2888 5219.3359 5365.4089 NaN NaN
seq_3535 3396.1716 3446.1970 5655.896546 NaN
seq_4077 4551.9108 4754.0073 4565.987654 5668.999998
seq_418 3716.3642 3796.4124 NaN NaN
seq_504 4556.9209 4631.9590 NaN NaN
seq_6162 5531.5032 5645.5774 NaN NaN
seq_6946 5179.3118 5364.4209 NaN NaN
seq_7009 6236.9764 6367.0500 NaN NaN
seq_9143_unamb 4631.9590 NaN NaN NaN
df.plot()

Categories

Resources