I am trying to create a numpy array from a geosoft formatted xyz text file. The format uses a '/' to start the header file, then space delimited after. I believe numpy sees the '/' and assigns as column zero.
Header info looks like this
/ Line Aircraft Flight Date JulL Time DateU TimeU Zn Easting Northing Lat Long xTrack ZFid_ms KFid AFid TFRAWT TFUNCT Mag4D VecX VecY VecZ VecTF MagRatio GPSHt Undul Sats HDop DGPS RadAlt BaroHPa Temp Humid CurrAmps Dn Up Samp Live RawTC RawK RawU RawTh Cosm LRange LStr LErr
data = np.genfromtxt(filename, deletechars="/", usecols=(0,18,20,21,22,23),invalid_raise=False, names=True,dtype=None)
data.dtype.names results in below, first column did not create a valid column name
('f0', 'TFRAWT', 'Mag4D', 'VecX', 'VecY', 'VecZ')
I thought deletechars was built for this purpose? Am I not using it correctly?
Related
I have an xlsx file, where each row corresponds to a sample with associated features in each column, as shown here:
xlsx file example
I am trying to convert this xlsx file into a dat file, with multiple spaces separating the columns, as displayed in the example below:
samples property feature1 feature2 feature3
sample1 3.0862 0.8626 0.7043 0.6312
sample2 2.8854 0.7260 0.7818 0.6119
sample3 0.6907 0.4943 0.0044 0.4420
sample4 0.9902 0.0106 0.0399 0.9877
sample5 0.7242 0.0970 0.3199 0.5504
I have tried doing this by creating a dataframe in pandas and using dataframe.to_csv to save the file as a .dat, but it only allows me to use one character as a delimiter. Does anyone know how I might go about creating a file like the one above?
You can use the string representation to_string of the dataframe, imported by pandas from Excel:
df = pd.read_excel('input.xlsx')
with open ('output.dat', 'w') as f:
f.write(df.to_string(index=False))
This is another approach to do so without using DataFrame. We will have more flexibility since we do all the structure ourselves from the ground up.
Suppose you have read the xlsx file and store it in the form of 2-d list as follows:
lines = [['sample1', 3.0862, 0.8626, 0.7043, 0.6312],
['sample2', 2.8854, 0.7260, 0.7818, 0.6119],
['sample3', 0.6907, 0.4943, 0.0044, 0.4420],
['sample4', 0.9902, 0.0106, 0.0399, 0.9877],
['sample5', 0.7242, 0.0970, 0.3199, 0.5504]]
We can make use of string methods like ljust, rjust, or center. Right here, I just show you the use of ljust that takes the length as the first argument. The length will be the total width for left justification.
One could also use f-string to do padding in the format of f'{var:^10.4f}'. The meaning of each component is:
^ represents centering (can be changed to < for left justification or > for right justification)
10 is the padding length
.4 is the number of decimal places
f means float
So, here is the final script.
padding1 = 12
padding2 = 10
print('samples'.ljust(padding1 + 1) + 'property ' + 'feature1 ' + 'feature2 ' + 'feature3')
for line in lines:
text = line[0].ljust(padding1)
for i in range(1, len(line)):
text += f'{line[i]:^{padding2}.4f}'
print(text)
I'm having trouble parsing a txt file (see here: File)
Here's my code
import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python', sep='\t', header=None)
Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)
However, nothing did the trick for me. For instance, when I use
sep='delimiter'
The dataframe is created but everything ends up in a single column.
When I use
error_bad_lines=False
The rows I'm interested in are simply skipped.
The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.
I guess another workaround would be to use
df = pd.read_csv(objectname, engine = 'python', sep = 'delimiter', header=None)
in combination with the split function Split function
Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10.
Any help is appreciated.
Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.
Here's what I did to extract the metadata lines:
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split('\n')
for index, line in enumerate(file_content):
if index<21 or index>37:
print(index, line.split('\t'))
Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).
Separately, I loaded the TSV into Pandas using
import pandas as pd
df = pd.read_csv('example.txt', engine = 'python',
sep='\t', error_bad_lines=False, header=None,
skiprows=list(range(21))+list(range(37,89)))
Again, I skipped the metadata at the start of the file and at the end of the file.
Here's the file I experimented with. I've trimmed the extra data to reduce line count.
TITLE Test123
DATA TYPE
ORIGIN JASCO
OWNER
DATE 19/03/28
TIME 16:39:44
SPECTROMETER/DATA SYSTEM
LOCALE 1031
RESOLUTION
DELTAX -0,5
XUNITS NANOMETERS
YUNITS CD [mdeg]
Y2UNITS HT [V]
Y3UNITS ABSORBANCE
FIRSTX 300,0000
LASTX 190,0000
NPOINTS 221
FIRSTY -0,78961
MAXY 37,26262
MINY -53,38971
XYDATA
300,0000 -0,789606 182,198 -0,0205245
299,5000 -0,691644 182,461 -0,0181217
299,0000 -0,700976 182,801 -0,0136756
298,5000 -0,614708 182,799 -0,0131957
298,0000 -0,422611 182,783 -0,0130073
195,0000 26,6231 997,498 4,7258
194,5000 -17,3049 997,574 4,6864
194,0000 16,0387 997,765 4,63967
193,5000 -14,4049 997,967 4,58593
193,0000 -0,277261 998,025 4,52411
192,5000 -29,6098 998,047 4,45244
192,0000 -11,5786 998,097 4,36608
191,5000 34,0505 998,282 4,27376
191,0000 28,2325 998,314 4,1701
190,5000 -13,232 998,336 4,05036
190,0000 -47,023 998,419 3,91883
##### Extended Information
[Comments]
Sample name X
Comment
User
Division
Company RWTH Aachen
[Detailed Information]
Creation date 28.03.2019 16:39
Data array type Linear data array * 3
Horizontal axis Wavelength [nm]
Vertical axis(1) CD [mdeg]
Vertical axis(2) HT [V]
Vertical axis(3) Abs
Start 300 nm
End 190 nm
Data interval 0,5 nm
Data points 221
[Measurement Information]
Instrument name CD-Photometer
Model name J-1100
Serial No. A001361635
Detector Standard PMT
Lock-in amp. X mode
HT volt Auto
Accessory PTC-514
Accessory S/N A000161648
Temperature 18.63 C
Control sonsor Holder
Monitor sensor Holder
Measurement date 28.03.2019 16:39
Overload detect 203
Photometric mode CD, HT, Abs
Measure range 300 - 190 nm
Data pitch 0.5 nm
CD scale 2000 mdeg/1.0 dOD
FL scale 200 mdeg/1.0 dOD
D.I.T. 0.5 sec
Bandwidth 1.00 nm
Start mode Immediately
Scanning speed 200 nm/min
Baseline correction Baseline
Shutter control Auto
Accumulations 3
I am trying to extract from multi-temporal files (netCDF) the value of the pixel at a specific location and time.
Each file is named: T2011, T2012, and so on until T2017.
Each file contains 365 layers, every layer corresponds to one day of a the year and that layer expresses the temperature of that day.
My goal is to extract information according to my input dataset.
I have a csv (locd.csv) with my targets and it looks like this:
id lat lon DateFin DateCount
1 46.63174271 7.405986324 02-02-18 43,131
2 46.64972969 7.484352537 25-01-18 43,123
3 47.27028727 7.603811832 20-01-18 43,118
4 46.99994455 7.063905466 05-02-18 43,134
5 47.08125481 7.19501811 20-01-18 43,118
6 47.37833814 7.432005368 11-12-18 43,443
7 47.43230354 7.445253182 30-12-18 43,462
8 46.73777711 6.777871255 09-04-18 43,197
69 47.42285191 7.113934735 09-04-18 43,197
The id is the location I am interested in, lat and lon: latitude and longitude), DateFin correspond to the date I am interested to know the temperature at that particular location and DateCount corresponds to the number
of days from 01-01-1900 to the date I am interested in (that's how the layers are indexed in the file).
For doing that I have something like this:
import glob
from netCDF4 import Dataset
import pandas as pd
import numpy as np
from datetime import date
import os
# Record all the years of the netCDF files into a Python list
all_years = []
for file in glob.glob('*.nc'):
print(file)
data = Dataset(file, 'r')
time = data.variables['time'] # that's how the days are stored
year = file[0:4]
all_years.append(year)
# define my input data
cities = pd.read_csv('locd.csv')
# extracting the data
for index, row in cities.iterrows():
id_row = row['id'] # id from the database
location_latitude = row['lat']
location_longitude = row['lon']
location_date = row['DateCount'] #number of day counting since 1900-01-01
# Sorting the all_years python list
all_years.sort()
for yr in all_years:
# Reading-in the data
data = Dataset(str(yr)+'.nc', 'r')
# Storing the lat and lon data of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat,lon and the lat,lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Identify the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the precipitation data
prec= data.variables['precipi'] # that's how the variable is called
for p_index in np.arange(0, len(location_date)):
print('Recording the value for '+ id_row+': ' + str(location_date[p_index]))
df.loc[id_row[location_date]]['Precipitation'] = prec[location_date, min_index_lat, min_index_lon]
# to record it in a new archive
df.to_csv('locationNew.csv')
My issues:
I don't manage to make it work. Every time there is a new thing coming, now it says that "id_row" must be a string.
Does anybody have a hint or experience working with these type of files?
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I've got a list of cities with associated lon,lat values that I'd like to turn into a DataFrame, but instead of reading from a CSV file, I want to have the user modify or add to these city,lat,lon values into a cell in an IPython notebook. Right now I have this solution that works, but it seems a bit ugly:
import pandas as pd
sta = array([
('Boston', 42.368186, -71.047984),
('Provincetown', 42.042745, -70.171180),
('Sandwich', 41.767990, -70.466219),
('Gloucester', 42.610253, -70.660570)
],
dtype=[('City','|S20'), ('Lat','<f4'), ('Lon', '<f4')])
# Create a Pandas DataFrame
obs = pd.DataFrame.from_records(sta,index='City')
print(obs)
Lat Lon
City
Boston 42.368187 -71.047981
Provincetown 42.042744 -70.171181
Sandwich 41.767990 -70.466217
Gloucester 42.610252 -70.660568
Is there a clearer, safer way to create the DataFrame?
I'm thinking that folks will forget the parenthesis, add a closing ',' on the last line, etc.
Thanks,
Rich
You could just create a big multiline string that they edit, then use read_csv to read it from a StringIO object:
x = """
City, Lat, Long
Boston, 42.4, -71.05
Provincetown, 42.04, -70.12
"""
>>> pandas.read_csv(StringIO.StringIO(x.strip()), sep=",\s*")
City Lat Long
0 Boston 42.40 -71.05
1 Provincetown 42.04 -70.12
Of course, people can still make errors with this (e.g., inserting commas), but the format is simpler.