I have some csv data in the following format.
Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09
L0 St vT 4R 0 0 0 0 0 0 0 0 0
L2 Tx st 4R 8 8 8 8 8 8 8 8 8
L2 Tx ss 4R 1 1 9 6 1 0 0 6 7
I want to plot a timeseries graph using the columns (Ln , Dr, Tg,Lab) as the keys and the 0:0n field as values on a timeseries graph.
I have the following code.
#!/usr/bin/env python
import matplotlib.pyplot as plt
import datetime
import numpy as np
import csv
import sys
with open("test.csv", 'r', newline='') as fin:
reader = csv.DictReader(fin)
for row in reader:
key = (row['Ln'], row['Dr'], row['Tg'],row['Lab'])
#code to extract the values and plot a timeseries.
How do I extract all the values in columns 0:0n without induviduall specifying each one of them. I want all the timeseries to be plotted on a single timeseries?
I'd suggest using pandas:
import pandas as pd
a=pd.read_csv('yourfile.txt',delim_whitespace=True)
for x in a.iterrows():
x[1][4:].plot(label=str(x[1][0])+str(x[1][1])+str(x[1][2])+str(x[1][3]))
plt.ylim(-1,10)
plt.legend()
I'm not really sure exactly what you want to do but np.loadtxt is the way to go here. make sure to set the delimiter correctly for your file
data = np.loadtxt(fname="test.csv",delimiter=',',skiprows=1)
now the n-th column of data is the n-th column of the file and same for rows.
you can access data by line: data[n] or by column: data[:,n]
Related
I using pandas to webscrape this site https://www.mapsofworld.com/lat_long/poland-lat-long.html but i only gettin 3 elements. How could I get all elements from table?
import numpy as np
import pandas as pd
#for getting world map
import folium
# Retreiving Latitude and Longitude coordinates
info = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",match='Augustow',skiprows=2)
#convering the table data into DataFrame
coordinates = pd.DataFrame(info[0])
data = coordinates.head()
print(data)
It looks like if you install and use html5lib as your parser it may fix your issues:
df = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",attrs={"class":"tableizer-table"},skiprows=2,flavor="html5lib")
>>>df
[ 0 1 2
0 Locations Latitude Longitude
1 NaN NaN NaN
2 Augustow 53°51'N 23°00'E
3 Auschwitz/Oswiecim 50°02'N 19°11'E
4 Biala Podxlaska 52°04'N 23°06'E
.. ... ... ...
177 Zawiercie 50°30'N 19°24'E
178 Zdunska Wola 51°37'N 18°59'E
179 Zgorzelec 51°10'N 15°0'E
180 Zyrardow 52°3'N 20°28'E
181 Zywiec 49°42'N 19°10'E
[182 rows x 3 columns]]
I have written a code to perform some data cleaning to get the final columns and values from a tab spaced file.
import matplotlib.image as image
import numpy as np
import tkinter as tk
import matplotlib.ticker as ticker
from tkinter import filedialog
import matplotlib.pyplot as plt
root = tk.Tk()
root.withdraw()
root.call('wm', 'attributes', '.', '-topmost', True)
files1 = filedialog.askopenfilename(multiple=True)
files = root.tk.splitlist(files1)
List = list(files)
%gui tk
for i,file in enumerate(List,1):
d = pd.read_csv(file,sep=None,engine='python')
h = d.drop(d.index[19:])
transpose = h.T
header =transpose.iloc[0]
df = transpose[1:]
df.columns =header
df.columns = df.columns.str.strip()
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
k =df.drop(columns =['Op:','Comment:','Mod Type:', 'PN', 'Irradiance:','Irr Correct:', 'Lamp Voltage:','Corrected To:', 'MCCC:', 'Rseries:', 'Rshunt:'], axis=1)
k.head()
I want to run this code to multiple files and do the same and concatenate all the results to one data frame.
for eg, If I select 20 files, then new data frame with one line of header and all the 20 results below with increasing order of the value from the column['Module Temp:'].
It would be great if someone could provide a solution to this problem
Please find the link to sample data:https://drive.google.com/drive/folders/1sL2-CwCGeGm0-fvcpzMVzgFnYzN3wzVb?usp=sharing
The following code shows how to parse the files and extract the data. It doesn't show the tkinter GUI component. files will represent your selected files.
Assumptions:
The first 92 rows of the files are always the measurement parameters
Rows from 93 are the measurements.
The 'Module Temp' for each file is different
The lists will be sorted based on the sort order of mod_temp, so the data will be in order in the DataFrame.
The list sorting uses the accepted answer to Sorting list based on values from another list?
import pandas as p
from patlib import Path
# set path to files
path_ = Path('e:/PythonProjects/stack_overflow/data/so_data/2020-11-16')
# select the correct files
files = path_.glob('*.ivc')
# create lists for metrics
measurement_params = list()
mod_temp = list()
measurements = list()
# iterate through the files
for f in files:
# get the first 92 rows with the measurement parameters
mp = pd.read_csv(f, sep='\t', nrows=91, index_col=0)
# remove the whitespace and : from the end of the index names
mp.index = mp.index.str.replace(':', '').str.strip().str.replace('\\s+', '_')
# get the column header
col = mp.columns[0]
# get the module temp
mt = mp.loc['Module_Temp', col]
# add Modult_Temp to mod_temp
mod_temp.append(float(mt))
# get the measurements
m = pd.read_csv(f, sep='\t', skiprows=92, nrows=3512)
# remove the whitespace and : from the end of the column names
m.columns = m.columns.str.replace(':', '').str.strip()
# add Module_Temp column
m['mod_temp'] = mt
# store the measure parameters
measurement_params.append(mp.T)
# store the measurements
measurements.append(m)
# sort lists based on mod_temp sort order
measurement_params = [x for _, x in sorted(zip(mod_temp, measurement_params))]
measurements = [x for _, x in sorted(zip(mod_temp, measurements))]
# create a dataframe for the measurement parameters
df_mp = pd.concat(measurement_params)
# create a dataframe for the measurements
df_m = pd.concat(measurements).reset_index(drop=True)
df_mp
Title: Comment Op ID Mod_Type PN Date Time Irradiance IrrCorr Irr_Correct Lamp_Voltage Module_Temp Corrected_To MCCC Voc Isc Rseries Rshunt Pmax Vpm Ipm Fill_Factor Active_Eff Aperture_Eff Segment_Area Segs_in_Ser Segs_in_Par Panel_Area Vload Ivld Pvld Frequency SweepDelay SweepLength SweepSlope SweepDir MCCC2 MCCC3 MCCC4 LampI IntV IntV2 IntV3 IntV4 LoadV PulseWidth1 PulseWidth2 PulseWidth3 PulseWidth4 TRef1 TRef2 TRef3 TRef4 MCMode Irradiance2 IrrCorr2 Voc2 Isc2 Pmax2 Vpm2 Ipm2 Fill_Factor2 Active_Eff2 ApertureEff2 LoadV2 PulseWidth12 PulseWidth22 Irradiance3 IrrCorr3 Voc3 Isc3 Pmax3 Vpm3 Ipm3 Fill_Factor3 Active_Eff3 ApertureEff3 LoadV3 PulseWidth13 PulseWidth23 RefCellID RefCellTemp RefCellIrrMM RefCelIscRaw RefCellIsc VTempCoeff ITempCoeff PTempCoeff MismatchCorr Serial_No Soft_Ver
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:12:52 100.007 100 Ref Cell 2400 25.2787 25 1.3669 46.4379 9.13215 0.43411 294.467 331.924 38.3403 8.65732 0.78269 1.89434 1.7106 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.4736 6.87023 6.8645 6 6 6.76 107.683 109.977 0 0 27.2224 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9834 1001.86 0.15142 0.15135 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Solarium SGE24P330 STC Admin MCIND_2021_0074 ModuleType1 NaN 17-09-2020 15:06:12 99.3671 100 Ref Cell 2400 25.3380 25 1.3669 45.2903 8.87987 0.48667 216.763 311.031 36.9665 8.41388 0.77338 1.77510 1.60292 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.405 6.82362 6.8212 6 6 6.6 107.660 109.977 0 0 25.9418 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 4.943 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 4.943 107.660 109.977 WPVS mono C-Si Ref Cell 25.3315 998.370 0.15085 0.15082 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:11:32 100.010 100 Ref Cell 2400 25.3557 25 1.3669 46.4381 9.11368 0.41608 299.758 331.418 38.3876 8.63345 0.78308 1.89144 1.70798 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.3820 6.87018 6.8645 6 6 6.76 107.683 109.977 0 0 27.2535 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9614 1003.80 0.15171 0.15164 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:14:09 99.9925 100 Ref Cell 2400 25.4279 25 1.3669 46.4445 9.14115 0.43428 291.524 332.156 38.2767 8.67776 0.78236 1.89566 1.71179 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.5044 6.87042 6.8645 6 6 6.76 107.660 109.977 0 0 27.1989 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.660 109.977 WPVS mono C-Si Ref Cell 26.0274 1000.93 0.15128 0.15121 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
df_m.head()
Voltage Current mod_temp
0 -1.193405 9.202885 25.2787
1 -1.196560 9.202489 25.2787
2 -1.193403 9.201693 25.2787
3 -1.196558 9.201298 25.2787
4 -1.199711 9.200106 25.2787
df_m.tail()
Voltage Current mod_temp
14043 46.30869 0.315269 25.4279
14044 46.31411 0.302567 25.4279
14045 46.31949 0.289468 25.4279
14046 46.32181 0.277163 25.4279
14047 46.33039 0.265255 25.4279
Plot
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
sns.scatterplot(x='Current', y='Voltage', data=df_m, hue='mod_temp', s=10)
plt.show()
Note
After doing this, I was having trouble plotting the data because the columns were not float type. However, an error occurred when trying to set the type. Looking back at the data, after row 92, there are multiple headers throughout the two columns.
Row 93: Voltage: Current:
Row 3631: Ref Cell: Lamp I:
Row 7169: Voltage2: Current2:
Row 11971: Ref Cell2: Lamp I2:
Row 16773: Voltage3: Current3:
Row 21575: Ref Cell3: Lamp I3:
Row 26377: Raw Voltage: Raw Current :
Row 29915: WPVS Voltage: WPVS Current:
I went back and used the nrows parameter when creating m, so only the first set of headers and associated measurements are extracted from the file.
I recommend writing a script using the csv module to read each file, and create a new file beginning at each blank row, this will make the files have consistent types of measurements.
This should be a new question, if needed.
There are various ways to do it. You can append one dataframe to another (basically stack one on top of the other), and you can do it in the loop. Here is an example. I use fake dfs but you will use your own
import pandas as pd
import numpy as np
combined = None
for _ in range(5):
# stub df creation -- you will use your real code here
df = pd.DataFrame(columns = ['Module Temp','A', 'B'], data = np.random.random((5,3)))
if combined is None:
# initialize with the first one
combined = df.copy()
else:
# add the next one
combined = combined.append(df, sort = False, ignore_index = True)
combined.sort_values('Module Temp', inplace = True)
Here combined will have all the dfs, sorted by 'Module Temp'
I have a .dat file, about whose origin I am not sure. I have to read this file in order to perform PCA. Assuming it to be white spaced file, I was successfully able to read the contents of file and ignore the first column (as it is a index), but the very first row. Below is the code:
import numpy as np
import pandas as pd
from numpy import array
myarray = pd.read_csv('hand_postures.dat', delim_whitespace=True)
myarray = array(myarray)
print(myarray.shape)
myarray = np.delete(myarray,0,1)
print(myarray)
print(myarray.shape)
The file is shared at the link https://drive.google.com/open?id=0ByLV3kGjFP_zekN1U1c3OGFrUnM. Can someone help me point out my mistake?
You need an extra parameter when calling pd.read_csv.
df = pd.read_csv('hand_postures.dat', header=None, delim_whitespace=True, index_col=[0])
df.head()
1 2 3 4 5 6 7 8 \
0
0 -65.55560 0.172413 44.4944 22.2472 0.000000 50.6723 34.3434 17.1717
1 -65.55560 2.586210 43.8202 21.9101 0.277778 51.4286 34.3434 17.1717
2 -45.55560 5.000000 43.8202 21.9101 0.833333 56.7227 42.4242 21.2121
3 5.55556 -2.241380 46.5169 23.2584 1.111110 70.3361 85.8586 42.9293
4 67.77780 20.689700 59.3258 29.6629 2.222220 80.9244 93.9394 46.9697
9 10 11 12 13 14 15 16 \
0
0 -0.235294 54.6154 39.7849 19.8925 0.705883 37.2656 41.3043 20.6522
1 -0.235294 55.3846 38.7097 19.3548 0.705883 38.6719 41.3043 20.6522
2 0.000000 63.0769 47.3118 23.6559 0.000000 47.8125 54.3478 27.1739
3 -0.117647 83.8462 90.3226 45.1613 0.352941 73.1250 92.3913 46.1957
4 0.117647 93.8462 98.9247 49.4624 -0.352941 89.2969 100.0000 50.0000
17 18 19 20
0
0 15.0 34.6584 54.1270 27.0635
1 14.4 35.2174 55.8730 27.9365
2 14.4 43.6025 69.8413 34.9206
3 3.6 73.7888 94.2857 47.1429
4 -1.2 92.2360 106.5080 53.2540
header=None specifies that the first row is part of the data (and not the header)
index_col=[0] specifies that the first column is to be treated as the index
I have a excel file that I want to group based on the Column name 'Step No.' and want the corresponding value.Here is a piece of code I wrote :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fpath=('/Users/Anil/Desktop/Test data.xlsx')
df=pd.read_excel(fpath)
data=df.loc[:,['Step No.','Parameter','Values']]
grp_data=pd.DataFrame(data.groupby(['Step No.','Values']).size().reset_index())
grp_data.to_excel('/Users/Anil/Desktop/Test1 data.xlsx')
The data gets grouped just as I want it to.
Step No. Values
1 62
1 62.5
1 63
1 66.5
1 68
1 70
1 72
1 76.5
1 77
2 66.5
2 67
2 69
3 75.5
3 77
But, I want data corresponding to each Step No. in a different excel sheet, i.e all values corresponding to Step No.1 in one sheet, Step No. 2 in another sheet and so on. I think I should use some sort of iteration, but don't know what kind exactly.
This should do it:
from pandas import ExcelWriter
steps = df['Step No.'].unique()
dfs = [df.loc[df['Step No.']==step] for step in steps]
def save_xls(list_dfs, xls_path):
writer = ExcelWriter(xls_path)
for n, df in enumerate(list_dfs):
df.to_excel(writer,'sheet%s' % n)
writer.save()
save_xls(dfs, 'YourFile.xlsx')
I have an excel file containing multiple worksheets. Each worksheet contains price & inventory data for individual item codes for a particular month.
for example...
sheetname = 201509
code price inventory
5001 5 92
5002 7 50
5003 6 65
sheetname = 201508
code price inventory
5001 8 60
5002 10 51
5003 6 61
Using pandas dataframe, how is the best way to import this data, organized by time and item code.
I need this dataframe to eventually be able to graph changes in price&inventory for item code 5001 for example.
I would appreciate your help. I am still new to python/pandas.
Thanks.
My solution...
Here is a solution I found to my problem.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
D201509 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201509', index_col='Code')
D201508 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201508', index_col='Code')
D201507 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201507', index_col='Code')
D201506 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201506', index_col='Code')
D201505 = pd.read_excel('ExampleSpreadsheet.xlsx', sheetname='201505', index_col='Code')
total = pd.concat(dict(D201509=D201509, D201508=D201508, D201507=D201507, D201506=D201506, D201505=D201505), axis=1)
total.head()
which will nicely produce this dataframe with hierarchical columns..
Now my new question is how would you plot the change in prices for every code # with this dataframe?
I want to see 5 lines (5001,5002,5003,5004,5005), with the x axis being the time (D201505, D201506, etc) and the y axis being the price value.
Thanks.
This will get your data into a data frame and do a scatter plot on 5001
import pandas as pd
import matplotlib.pyplot as plt
import xlrd
file = r'C:\dickster\data.xlsx'
list_dfs = []
xls = xlrd.open_workbook(file, on_demand=True)
for sheet_name in xls.sheet_names():
df = pd.read_excel(file,sheet_name)
df['time'] = sheet_name
list_dfs.append(df)
dfs = pd.concat(list_dfs,axis=0)
dfs = dfs.sort(['time','code'])
which looks like:
code price inventory time
0 5001 8 60 201508
1 5002 10 51 201508
2 5003 6 61 201508
0 5001 5 92 201509
1 5002 7 50 201509
2 5003 6 65 201509
And now the plot of 5001: price v inventory:
dfs[dfs['code']==5001].plot(x='price',y='inventory',kind='scatter')
plt.show()
which produces: