How to read columns from different files and plot? - python

I have data of concentrations for every day of year 2005 until 2018. I want to read three columns of three different files and combine them into one, so I can plot them.
Data:file 1
time, mean_OMNO2d_003_ColumnAmountNO2CloudScreened
2005-01-01,-1.267651e+30
2005-01-02,4.90778397e+15
...
2018-12-31,-1.267651e+30
Data:file 2
time, OMNO2d_003_ColumnAmountNO2TropCloudScreened
2005-01-01,-1.267651e+30
2005-01-02,3.07444147e+15
...
Data:file 3
time, OMSO2e_003_ColumnAmountSO2_PBL
2005-01-01,-1.267651e+30
2005-01-02,-0.0144000314
...
I want to plot time and mean_OMNO2d_003_ColumnAmountNO2CloudScreened, OMNO2d_003_ColumnAmountNO2TropCloudScreened, OMSO2e_003_ColumnAmountSO2_PBL into one graph.
import glob
import pandas as pd
file_list = glob.glob('*.csv')
no= []
no2=[]
so2=[]
for f in file_list:
df= pd.read_csv(f, skiprows=8, parse_dates =['time'], index_col ='time')
df.columns=['no','no2','so2']
no.append([df["no"]])
no2.append([df["no2"]])
so2.append([df["so2"]])
How do I solve the problem?

This is very doable. I had a similar problem with 3 files all in one plot. My understanding is that you want to compare levels of NO, NO2, and SO2, that each column is in comparable order, and that you want to compare across rows. If you are ok with importing matplotlib and numpy, something like this may work for you:
import numpy as np
import matplotlib as plt
NO = np.asarray(df["no1"])
NO2 = np.asarray(df["no2"]))
SO2 = np.asarray(df["so2"))
timestamp = np.asarray(df["your_time_stamp"])
plt.plot(timestamp, NO)
plt.plot(timestamp, NO2)
plt.plot(timestamp, SO2)
plt.savefig(name_of_plot)
This will need some adjusting for your specific data frame, but I hope you see what I am getting at!

Related

OHLC of Multiple Scrips Using Pandas Resample function

I have ticks data of 2 scrips (scrip_names are abc and xyz). Since the ticks data is at a "second" level, I want to convert this to OHLC (Open, High, Low, Close) at 1 Minute level.
When the ticks data contains only 1 scrip, I use the following code (OHLC of Single Scrip.py) to get the OHLC at 1 Minute level. This code gives the desired result.
Code:
import os
import time
import datetime
import pandas as pd
import numpy as np
ticks=pd.read_csv(r'C:\Users\tech\Downloads\ticks.csv')
ticks=pd.DataFrame(ticks)
#ticks=ticks.where(ticks['scrip_name']=="abc")
#ticks=ticks.where(ticks['scrip_name']=="xyz")
ticks['timestamp'] = pd.to_datetime(ticks['timestamp'])
ticks=ticks.set_index(['timestamp'])
ohlc_prep=ticks.loc[:,['last_price']]
ohlc_1_min=ohlc_prep['last_price'].resample('1min').ohlc().dropna()
ohlc_1_min.to_csv(r'C:\Users\tech\Downloads\ohlc_1_min.csv')
Result:
However, when the ticks data contains more than 1 scrip, this code doesn't work. What modifications should be done to the code to get the following result (filename: expected_result.csv) which is grouped by scrip_name.
Expected Result:
Here is the link to ticks data, python code for single scrip, result of single scrip, and desired result of multiple scrips: https://drive.google.com/file/d/1Y3jngm94hqAW_IJm-FAsl3SArVhnjGJE/view?usp=sharing
Any help is much appreciated.
thank you.
I think you need groupby like:
ticks['timestamp'] = pd.to_datetime(ticks['timestamp'])
ticks=ticks.set_index(['timestamp'])
ohlc_1_min=ticks.groupby('scrip_name')['last_price'].resample('1min').ohlc().dropna()
Or:
ohlc_1_min=(ticks.groupby(['scrip_name',
pd.Grouper(freq='1min', level='timestamp')])['last_price']
.ohlc()
.dropna())

Sum of columns from the same file

I need to calculate a moment (Mz,correct) given by the sum of a moment (Mz) and a force (Fx) multiplied by its arm (300.56) because I need to make a change of reference system and mobilize everything on the new system of reference. This is the script I tried to write that Fx and Mz the same starting file (.dat):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#-----------------------input file-------------------------------
filename = 'drag_time_series' #nome file sorgente
# specifica percorso file DAT
df = pd.read_csv(rf"C:\Users\suemack528\Desktop\OneDrive - Università degli Studi di Padova\deme\unipd\magistrale\TESI\impalcato\drag\{filename}.dat",
header=1, delim_whitespace=True)
df = df.round(decimals=3)
#-----for "cycle" to obtain correct moment-----
for i in range(len('Mz')):
Mz,correct[i]=df('Mz') + df('Fx')*300.56
I think that is not correct. How can I write this script better? I'm doing it with spyder
error obtained
To get a column from a pandas dataframe you can use [] (instead of ()). For more information regarding the use of indexes and selecting data from a dataframe you can check the pandas documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics

Pandas read_csv: Columns are being imported as rows

Edit: I believe this was all user error. I have been typing df.T by default, and it just occurred to me that this is very likely the TRANSPOSE output. By typing df, the data frame is output normally (headers as columns). Thank you for those who stepped up to try and help. In the end, it was just my misunderstanding of pandas language..
Original Post
I'm not sure if I am making a simple mistake but the columns in a .csv file are being imported as rows using pd.read_csv. The dataframe turns out to be 5 rows by 2000 columns. I am importing only 5 columns out of 14 so I set up a list to hold the names of the columns I want. They match exactly those in the .csv file. What am I doing wrong here?
import os
import numpy as np
import pandas as pd
fp = 'C:/Users/my/file/path'
os.chdir(fp)
cols_to_use = ['VCOMPNO_CURRENT', 'MEASUREMENT_DATETIME',
'EQUIPMENT_NUMBER', 'AXLE', 'POSITION']
df = pd.read_csv('measurement_file.csv',
usecols=cols_to_use,
dtype={'EQUIPMENT_NUMBER': np.int,
'AXLE': np.int},
parse_dates=[2],
infer_datetime_format=True)
Output:
0 ... 2603
VCOMPNO_CURRENT T92656 ... T5M247
MEASUREMENT_DATETIME 7/26/2018 13:04 ... 9/21/2019 3:21
EQUIPMENT_NUMBER 208 ... 537
AXLE 1 ... 6
POSITION L ... R
[5 rows x 2000 columns]
Thank you.
Edit: To note, if I import the entire .csv with the standard pd.read_csv('measurement_file.csv'), the columns are imported properly.
Edit 2: Sample csv:
VCOMPNO_CURRENT,MEASUREMENT_DATETIME,REPAIR_ORDER_NUMBER,EQUIPMENT_NUMBER,AXLE,POSITION,FLANGE_THICKNESS,FLANGE_HEIGHT,FLANGE_SLOPE,DIAMETER,RO_NUMBER_SRC,CL,VCOMPNO_AT_MEAS,VCOMPNO_SRC
T92656,10/19/2018 7:11,5653054,208,1,L,26.59,27.34,6.52,691.3,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,1,R,26.78,27.25,6.64,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,L,26.6,27.13,6.49,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,R,26.61,27.45,6.75,691.6,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T7L672,10/19/2018 7:11,5653054,208,3,L,26.58,27.14,6.58,644.4,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
T7L672,10/19/2018 7:11,5653054,208,3,R,26.21,27.44,6.17,644.5,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
A simple workaround here is to just to take the transpose of the dataframe.
Link to Pandas Documentation
df = pd.DataFrame.transpose(df)
Can you try like this?
import pandas as pd
dataset = pd.read_csv('yourfile.csv')
#filterhere
dataset = dataset[cols_to_use]

Loop for multiple dataframes with a function

I tried to run a function through multiple data frames, but I have a problem with it. My main questions are:
1) I tried to run a defined function with zip(df1, df2, df3,...) and the outputs are new DF1, DF2, DF3,...; however, I failed. Is it possible to run a function through multiple dataframes and outputs are also dataframes by "zip"?
2) If zip() is not a choice, how do I do to make my function running in a loop? Currently, I just have three dataframes and they are easy to be done separately. But I would like to know how to handle it when I have 50, 100, or even more dataframes.
Here are my codes:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import scipy.stats as ss
# *********** 3 City Temperature files from NOAA ***********
# City 1
df1 = pd.pandas.read_csv('https://docs.google.com/spreadsheets/d/1Uj5N363dEVJZ9WVy2a_kkbJKJnyyE5qnEqOfzO0UCQE/gviz/tq?tqx=out:csv')
# City 2
df2 = pd.pandas.read_csv('https://docs.google.com/spreadsheets/d/13CgTdDCDzB_3WIYIRVMeLu6E36xzHSzRR5T_Ku0vThA/gviz/tq?tqx=out:csv')
# City 3
df3 = pd.pandas.read_csv('https://docs.google.com/spreadsheets/d/17pNZFIaV_NpQfSed-msIGu9jzzqF6JBvCZrBRiU2ZkQ/gviz/tq?tqx=out:csv')
def CleanDATA(data):
data = data.drop(columns=['Annual'])
data = data.drop(data.index[29:-1])
data = data.drop(data.index[-1])
monthname=[]
Temp=[]
for row in range(0,len(data)):
for col in range(1,13):
#monthname.append(str(col)+"-"+str(data['Year'][row]))
monthname.append(str(data['Year'][row])+str(col))
Temp.append(data.iloc[row,col])
df0=pd.DataFrame()
df0['Month']=monthname
df0['Temperature']=Temp
df0['Month']=pd.to_datetime(df0['Month'],format='%Y.0%m') #change the date form
df0['Month'] = pd.to_datetime(df0['Month']).dt.date # remove time, only keep date
data =df0[df0.applymap(np.isreal).all(1)] # remove non-numerical
return data
data1 = CleanDATA(df1)
data2 = CleanDATA(df2)
data3 = CleanDATA(df3)
Also, I found an issue with Pandas while reading the following excel file:
https://drive.google.com/file/d/1V9fKpACbLrSi0NfB0FHSgc96PQerKkUF/view?usp=sharing (This is city 1 temperature data from 1990-2019)
2019 is ongoing, hence, NOAA stations only provide information till this May. The excel data labels all missing data by "M". I noticed that once the column comes with an "M", I cannot use boxplot directly even I already drop 2019 row. Spyder console will say "items [Jun to Dec]" are missing (and the wired thing is I can use the same data to plot XY line plot). To plot the boxplot, I have to manually remove 2019 information (1 row) in excel than read the new file.
I would do it using dictionaries (or lists or other iterable).
cities = {'city1': 'https://...', 'city2': 'https://...', 'city3': 'https://...'}
df = {}
data = {}
for city, url in iteritems(cities):
df[city] = pd.pandas.read_csv(url)
data[city] = CleanDATA(df[city])

Select several years pandas dataframe

I am trying to select several years from a dataframe in monthly resolution.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import netCDF4 as nc
#-- open net-cdf and read in variables
data = nc.Dataset('test.nc')
time = nc.num2date(data.variables['Time'][:],
data.variables['Time'].units)
df = pd.DataFrame(data.variables['mgpp'][:,0,0], columns=['mgpp'])
df['dates'] = time
df = df.set_index('dates')
print(df.head())
This is what the head looks like:
mgpp
dates
1901-01-01 0.040735
1901-02-01 0.041172
1901-03-01 0.053889
1901-04-01 0.066906
Now I managed to extract one year:
df_cp = df[df.index.year == 2001]
but how would I extract several years, say 1997, 2001 and 2007 and have them stored in the same dataframe? Is there a one/ two line solution? My only idea for now is to iterate and then merge the dataframes but maybe there is a better solution!

Categories

Resources