I'm having trouble parsing a txt file (see here: File)
Here's my code
import pandas as pd
objectname = r"path"
df = pd.read_csv(objectname, engine = 'python', sep='\t', header=None)
Unfortunately it does not work. Since this question has been asked several times, I tried lots of proposed solutions (most of them can be found here: Possible solutions)
However, nothing did the trick for me. For instance, when I use
sep='delimiter'
The dataframe is created but everything ends up in a single column.
When I use
error_bad_lines=False
The rows I'm interested in are simply skipped.
The only way it works is when I first open the txt file, copy the content, paste it into google sheets, save the file as CSV and then open the dataframe.
I guess another workaround would be to use
df = pd.read_csv(objectname, engine = 'python', sep = 'delimiter', header=None)
in combination with the split function Split function
Is there any suggestion how to make this work without the need to convert the file or to use the split function? I'm using Python 3 and Windows 10.
Any help is appreciated.
Your file has tab separators but is not a TSV. The file is a mixture of metadata, followed by a "standard" TSV, followed by more metadata. Therefore, I found tackling the metadata as a separate task from loading the data to be useful.
Here's what I did to extract the metadata lines:
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split('\n')
for index, line in enumerate(file_content):
if index<21 or index>37:
print(index, line.split('\t'))
Note that the lines denoting the start and stop of metadata (21 and 37 in my example) are specific to the file. I've provided the trimmed data I used below (based on your linked file).
Separately, I loaded the TSV into Pandas using
import pandas as pd
df = pd.read_csv('example.txt', engine = 'python',
sep='\t', error_bad_lines=False, header=None,
skiprows=list(range(21))+list(range(37,89)))
Again, I skipped the metadata at the start of the file and at the end of the file.
Here's the file I experimented with. I've trimmed the extra data to reduce line count.
TITLE Test123
DATA TYPE
ORIGIN JASCO
OWNER
DATE 19/03/28
TIME 16:39:44
SPECTROMETER/DATA SYSTEM
LOCALE 1031
RESOLUTION
DELTAX -0,5
XUNITS NANOMETERS
YUNITS CD [mdeg]
Y2UNITS HT [V]
Y3UNITS ABSORBANCE
FIRSTX 300,0000
LASTX 190,0000
NPOINTS 221
FIRSTY -0,78961
MAXY 37,26262
MINY -53,38971
XYDATA
300,0000 -0,789606 182,198 -0,0205245
299,5000 -0,691644 182,461 -0,0181217
299,0000 -0,700976 182,801 -0,0136756
298,5000 -0,614708 182,799 -0,0131957
298,0000 -0,422611 182,783 -0,0130073
195,0000 26,6231 997,498 4,7258
194,5000 -17,3049 997,574 4,6864
194,0000 16,0387 997,765 4,63967
193,5000 -14,4049 997,967 4,58593
193,0000 -0,277261 998,025 4,52411
192,5000 -29,6098 998,047 4,45244
192,0000 -11,5786 998,097 4,36608
191,5000 34,0505 998,282 4,27376
191,0000 28,2325 998,314 4,1701
190,5000 -13,232 998,336 4,05036
190,0000 -47,023 998,419 3,91883
##### Extended Information
[Comments]
Sample name X
Comment
User
Division
Company RWTH Aachen
[Detailed Information]
Creation date 28.03.2019 16:39
Data array type Linear data array * 3
Horizontal axis Wavelength [nm]
Vertical axis(1) CD [mdeg]
Vertical axis(2) HT [V]
Vertical axis(3) Abs
Start 300 nm
End 190 nm
Data interval 0,5 nm
Data points 221
[Measurement Information]
Instrument name CD-Photometer
Model name J-1100
Serial No. A001361635
Detector Standard PMT
Lock-in amp. X mode
HT volt Auto
Accessory PTC-514
Accessory S/N A000161648
Temperature 18.63 C
Control sonsor Holder
Monitor sensor Holder
Measurement date 28.03.2019 16:39
Overload detect 203
Photometric mode CD, HT, Abs
Measure range 300 - 190 nm
Data pitch 0.5 nm
CD scale 2000 mdeg/1.0 dOD
FL scale 200 mdeg/1.0 dOD
D.I.T. 0.5 sec
Bandwidth 1.00 nm
Start mode Immediately
Scanning speed 200 nm/min
Baseline correction Baseline
Shutter control Auto
Accumulations 3
Related
I am working with a dataframe of 1006150 rows and 3 columns, where each row contains the abstract of a wikipedia resource:
>>> print(df)
individual abstract type
0 -ismist_Recordings "-ismist Recordings was founded in 1992 as -is... RecordLabel
1 –30–_(The_Wire) ""–30–" is the series finale of the HBO origin... TelevisionEpisode
2 !!! "!!! is a dance-punk band that formed in Sacra... Band
3 !!!_(album) "!!! is the eponymous debut studio album by ro... Album
4 !Arriba!_La_Pachanga "!Arriba! La Pachanga is an album by Mongo San... Album
The goal is to vectorize the abstract column in order to feed a text model.
The problem is that in R, when i try to do get the abstracts list to perform this conversion i end with a variable of large size (around 800 MB), this leads to run out of memory when i try to run the vectorizer. I have tried quanteda's dfm() and TfIdfVectorizer from superml package. With quanteda i got a dfm of size 2 Gb (too large to train the model) and superml package throws an error out-of-memory:
> df <- read.csv(file="t.csv", stringsAsFactors = F)
> abstracts_list <- df$abstract
> object.size(abstracts_list)
806675688 bytes
> library("superml")
Loading required package: R6
> tf <- TfIdfVectorizer$new()
> tf$fit_transform(abstracts_list)
Error: no se puede ubicar un vector de tamaño 7822.6 Gb
The curious thing is that in python I don't have this problem, after loading the data and extracting the list of abstracts, the variable weighs about 8 MB.
import pandas as pd
import sys
df = pd.read_csv("t.csv")
train_data = df['abstract'].tolist()
sys.getsizeof(train_data)
>>> sys.getsizeof(df['abstract'])
1093787258
>>> sys.getsizeof(train_data)
8049256
It seems that in python, converting a column into a list with numpy.ndarray.tolist reduces the size considerably (R=800MB, Py=8MB), is there a similar method in R that allows me to convert a column of a dataframe into a list of smaller size?
Note: in both python and R scripts i have used the same csv file.
Note 2: Unlike R in python I can use TfidfVectorizer without any problem (sklearn.feature_extraction.text.TfidfVectorizer)
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)
I expect that this don't be a classic beginner question. However I read and spent days trying to save my csv data without success.
I have a function that uses an input parameter that I give manually. The function generates 3 columns that I saved in a CSV file. When I want to use the function with other inputs and save the new data allocated at right from the previous computed columns, the result is that pandas sort my CSV file in 3 single columns one below each other with the headings.
I'm using the next code to save my data:
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a')
and the result is:
dot lake mock
1 42 11.914558
2 41 42.446977
3 40 89.188668
dot lake mock
1 42 226.266513
2 41 317.768887
dot lake mock
3 42 560.171830
4. 41. 555.005333
What I want is:
dot lake mock mock mock
0 42 11.914558. 226.266513. 560.171830
1 41 42.446977. 317.768887. 555.005533
2 40 89.188668
UPDATE:
My DataFrame was generated using a function like this:
First I opened a csv file:
df1=pd.read_csv('current_state.csv')
def my_function(df1, photos, coords=['X', 'Y']):
Hzs = t.copy()
shifts = np.floor(Hzs / t_step).astype(np.int)
ms = np.zeros(shifts.size)
delta_inv = np.arange(N+1)
dot = delta_inv[N:0:-1]
lake = np.arange(1,N+1)
for i, shift in enumerate(shifts):
diffs = df1[coords] - df1[coords].shift(-shift)
sqdist = np.square(diffs).sum(axis=1)
ms[i] = sqdist.sum()
mock = np.divide(ms, dot)
msds = pd.DataFrame({'dot':dot, 'lake':lake, 'mock':mock})
return msds
data = my_function(df1, photos, coords=['X', 'Y'])
print(data)
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a'
I looked for several day the way to write in a csv file containing several computed columns just right to the next one. Even the unpleasant comments of some guys! I finally found how to do this. If someone need something similar:
First I save my data using to_csv:
data.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',',mode='a', index=False)
after the file has been already generated with the headers, I remove the index that I don't need and I only call the function using at the end:
b = data
a = pd.read_csv('data_new.csv')
c = pd.concat ([a,b],axis=1, ignore_index=True)
c.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',', index=False)
As a result I got the CSV file desired and is possible to call the function the times that you want!
I need to access some grib files. I have figured out how to do it, using pygrib.
However, the only way I figured out how to do it is painstakingly slow.
I have 34 years of 3hrly data, they are organized in ~36 files per year (one every 10 days more or less). for a total of about 1000 files.
Each file has ~80 “messages” (8 values per day for 10 days). (they are spatial data, so they have (x,y) dimensions).
To read all my data I write:
grbfile = pygrib.index(filename, 'shortName', 'typeOfLevel', 'level')
var1 = grbfile.select(typeOfLevel='pressureFromGroundLayer', level=180, shortName='unknown')
for it in np.arange(len(var1)):
var_values, lat1, lon1 = var1[it].data()
if (it==0):
tot_var = np.expand_dims(var_values,axis=0)
else:
tot_var = np.append(tot_var, np.expand_dims(var_values,axis=0),axis=0)
and repeat this for each of the 1000 files.
is there a quicker way? like loading all the ~80 layers per grib files at once? something like:
var_values, lat1, lon1 = var1[:].data()
If I understand you correctly, you want the data from all 80 messages in each file stacked up in one array.
I have to warn you, that that array will get very large, and may cause NumPy to throw a MemoryError (happened to me before) depending on your grid size etc.
That being said, you can do something like this:
# substitute with a list of your file names
# glob is a builtin library that can help accomplish this
files = list_of_files
grib = pygrib.open(files[0]) # start with the first one
# grib message numbering starts at 1
data, lats, lons = grib.message(1).data()
# while np.expand_dims works, the following is shorter
# syntax wise and will accomplish the same thing
data = data[None,...] # add an empty dimension as axis 0
for m in xrange(2, grib.messages + 1):
data = np.vstack((data, grib.message(m).values[None,...]))
grib.close() # good practice
# now data has all the values from each message in the first file stacked up
# time to stack the rest on there
for file_ in files[1:]: # all except the first file which we've done
grib = pygrib.open(file_)
for msg in grib:
data = np.vstack((data, msg.values[None,...]))
grib.close()
print data.shape # should be (80 * len(files), nlats, nlons)
This may gain you some speed. pygrib.open objects act like generators, so they pass you each pygrib.gribmessage object as it's called for instead of building a list of them like the select() method of a pygrib.index does. If you need all the messages in a particular file then this is the way I would access them.
Hope it helps!