I've got an excel spreadsheet that I would like to use python to convert the measurements from cm3/day to just cm3/year.
is there a way to do this?
I've looked into openpyxl mostly as this module seems to come up the most for excel editing but I guess I'm mostly confused about how to edit the units so they are all the same... I can't seem to find a module that supports what I'm trying to do.
You can do this easily with pandas. You may need to install xlrd:
pip3 install pandas xlrd
or just save your file as csv.
import pandas as pd
# Read the file with read_csv() or read_excel()
df = pd.read_excel('your_file.xlsx', index_col=0) # Your index is the first column
>>> df
measure amount
precip
1 cm3/day 45
2 cm3/day 132
3 cm3/year 9565
4 cm3/sec 5
5 cm3/day 67
6 cm3/day 52
7 cm3/sec 2
8 cm3/day 78
9 cm3/sec 3
10 cm3/day 92
Then you can use apply() to check and update values as you want. This will apply any function to each line of a pd.DataFrame with option axis=1. The applied function receives a line of your data as pd.Series object.
Let's define a function:
def _update(serie):
val = serie['amount'] # The original value
volume, time = serie['measure'].split('/') # The time unit
# Check and update
if time == 'year':
return serie
elif time == 'day':
serie['amount'] = val * 365
elif time == 'hour':
serie['amount'] = val * 24 * 365
elif time == 'sec':
serie['amount'] = val * 3600 * 24 * 365
# Update measure col
serie['measure'] = 'cm3/year'
return serie
Then apply the function:
new_df = df.apply(_update, axis=1)
>>> new_df
measure amount
precip
1 cm3/year 16425
2 cm3/year 48180
3 cm3/year 9565
4 cm3/year 157680000
5 cm3/year 24455
6 cm3/year 18980
7 cm3/year 63072000
8 cm3/year 28470
9 cm3/year 94608000
10 cm3/year 33580
# Save de new file:
new_df.to_excel('new_file.xlsx')
Hope this will help !
If the file is in "*.xlsx" format you can read the file in python like this:
#first import necessary packages
import pandas as pd
import numpy as np
data = pd.read_excel(file_name)
If in "*.csv" format do this:
#first import necessary packages
import pandas as pd
import numpy as np
data = pd.read_csv(file_name)
To perform a calculation on a column(cm3/day/sec--this format I don't get but if you had cm3/day you could convert it yo cm3/year by the below code)
#first check the type of your column
data["column"].dtype
#based on what you get as type
#If your column's data type is string
#convert it to integer
data["column_name"] = data["column_name"].astype(int)
#convert it to float
data["column_name"] = data["column_name"].astype(float)
# if your column is already of numeric type don't change it
#to convert cm3/day to cm3/year
data["column_name"] = data["column_name"]*365
PS: I can't see the linked image so I couldn't use the valid column names in the excel sheet
Related
I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]
I'm trying to create a for-loop that automatically runs through my parsed list of NASDAQ stocks, and inserts their Quandl codes to then be retrieved from Quandl's database. essentially creating a large data set of stocks to perform data analysis on. My code "appears" right, but when I print the query it only prints 'GOOG/NASDAQ_Ticker' and nothing else. Any help and/or suggestions will be most appreciated.
import quandl
import pandas as pd
import matplotlib.pyplot as plt
import numpy
def nasdaq():
nasdaq_list = pd.read_csv('C:\Users\NAME\Documents\DATASETS\NASDAQ.csv')
nasdaq_list = nasdaq_list[[0]]
print nasdaq_list
for abbv in nasdaq_list:
query = 'GOOG/NASDAQ_' + str(abbv)
print query
df = quandl.get(query, authtoken="authoken")
print df.tail()[['Close', 'Volume']]
Iterating over a pd.DataFrame as you have done iterates by column. For example,
>>> df = pd.DataFrame(np.arange(9).reshape((3,3)))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
>>> for i in df[[0]]: print(i)
0
I would just get the first column as a Series with .ix,
>>> for i in df.ix[:,0]: print(i)
0
3
6
Note that in general if you want to iterate by row over a DataFrame you're looking for iterrows().
I have the following data set in a csv file:
vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---44:13.0---18.13533401---19.10000038---316---389.1700134
I am trying to write a function launch_time() with two inputs (dataframe, vehicle name) that returns the first time the gspd is reported above 10.0 m/s.
The output time must be converted from a string (HH:MM:SS.SS) to a minutes after 12:00 format.
It should look something like this:
>>> launch_time(df, veh_1)
30.0
I will use this function to iterate through each vehicle and then need to record the results into a list of tuples with the format (v_name, launch time) in launch sequence order.
It should look something like this:
'veh_1', 30.0, 'veh_2', 15.0
Disclosure: my python/pandas knowledge is very entry-level.
You can use read_csv with separator -{3,} - read csv with 3 and more -:
import pandas as pd
from pandas.compat import StringIO
temp=u"""vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---45:13.0---18.13533401---19.10000038---316---389.1700134"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="-{3,}", engine='python')
print (df)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
0 veh_1 17:19.5 0.163472 0.14 213 273.890015
1 veh_2 17:19.5 0.505787 0.17 214 273.910004
2 veh_3 17:19.8 0.173485 0.11 213 273.980011
3 veh_4 44:12.4 18.646734 19.23 316 388.929993
4 veh_5 45:13.0 18.135334 19.10 316 389.170013
Then convert column time to_timedelta, filter all rows above 10m/s by boolean indexing, sort_values, group on vehicles using groupby, then get the first value in each group and last zip columns vehicle and time and convert to list:
df.time = pd.to_timedelta('00:' + df.time, unit='h').\
astype('timedelta64[m]').astype(int)
req = df[df['gspd[m/s]'] > 10].\
sort_values('time', ascending=True).\
groupby('vehicle', as_index=False).head(1)
print(req)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
4 veh_5 45 18.135334 19.10 316 389.170013
3 veh_4 44 18.646734 19.23 316 388.929993
L = list(zip(req['vehicle'],req['time']))
print (L)
[('veh_5', 45), ('veh_4', 44)]
Apologies for this basic question. I am new to Python and having some problem with my codes. I used pandas to load in a .csv file and having problem accessing particular elements.
import pandas as pd
dateYTM = pd.read_csv('Date.csv')
print(dateYTM)
## Result
# Date
# 0 20030131
# 1 20030228
# 2 20030331
# 3 20030430
# 4 20030530
#
# Process finished with exit code 0
How can I access say the first date? I tried many difference ways but wasn't able to achieve what I want? Many thanks.
You can use read_csv with parameter parse_dates loc, see Selection By Label:
import pandas as pd
import numpy as np
import io
temp=u"""Date,no
20030131,1
20030228,3
20030331,5
20030430,6
20030530,3
"""
#after testing replace io.StringIO(temp) to filename
dateYTM = pd.read_csv(io.StringIO(temp), parse_dates=['Date'])
print dateYTM
Date no
0 2003-01-31 1
1 2003-02-28 3
2 2003-03-31 5
3 2003-04-30 6
4 2003-05-30 3
#df.loc[index, column]
print dateYTM.loc[0, 'Date']
2003-01-31 00:00:00
print dateYTM.loc[0, 'no']
1
But if you need only one value, better is use at see Fast scalar value getting and setting:
#df.at[index, column]
print dateYTM.at[0, 'Date']
2003-01-31 00:00:00
print dateYTM.at[0, 'no']
1
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).