Currently, I'm trying to use astropy.io.ascii in python anaconda to write a .dat file that includes data I've already read in (using ascii) from a different .dat file. I defined a specific table in the pre-existing file to be Data, the problem with Data is that I need to multiply the first of the columns by a factor of 101325 to change it's units, and I need fourth of the four columns to disappear entirely. So I defined the first column as Pressure_pa and I converted its units, then I defined the other two columns to be Altitude_km and Temperature_K. Is there any way I can use ascii's write function to tell it to write a .dat file containing the three columns I defined? And how would I go about it? Below is the code that has brought me up to the point of having defined these three columns of data:
from astropy.io import ascii
Data=ascii.read('output_couple_121_100.dat',guess=False,header_start=384,data_start=385,data_end=485,delimiter=' ')
Pressure_pa=Data['P(atm)'][:}*101325
Altitude_km=Data['Alt(km)'][:]
Temperature_K=Data['T'][:]
Now I thought that I might be able to use ascii.write(), to write a .dat file with Pressure_pa, Altitude_km and Temperature_K into the same file, is there any way to do this?
So I think I figured it out! I'll create a more generic version to fit others
from astropy.io import ascii
Data=ascii.read('filename.dat',guess=False,header_start=1,data_start=2,data_end=10,delimiter=' ')
#above: defining Data as a certain section of a .dat file beginning at line 2 through 10 with headers in line 1
ascii.write(Data,'new_desired_file_name.dat',names=['col1','col2','col3','col4'],exclude_names=['col3'],delimiter=' ')
#above: telling ascii to take Data and creat a .dat file with it, when defining the names, define a name for every column in Data and then use the exclude_names command to tell it not to include those specific columns
Related
I have an html form where I am getting the user to select their bank and upload a CSV file of their transactions to handle financial data:
I can store the file in a variable named 'file' but can't find a way to open it with traditional methods:
e.g. this doesn't work
I know the file is valid in the python code because I can open it with pandas, it messes up the column headings as there is some preamble data in the file.
Here is the file:
I am trying to do this so I can search for a row number by string. I need to know what row number 'Date' is on so I can pass that value into skiprows() with pandas in order to get a correct dataframe. This is what I came up with so far:
But obviously I cannot open the file in the first place. Ideally my output would be 7. I can't just use a static value of 7 for skiprows() with pandas as the amount of preamble data before the table changes from file to file.
This may not be an optimal answer, but maybe it will work for you:
file_content = file.stream.read().decode("UTF8")
lines = file_content.split('\n')
Then you can look for the line starting with Date to figure your skiprows value.
I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3], and a variable number of other columns, Var_col = [a, b, c, d, e]. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c].
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
Ignore these columns and the carriage return contained within? or
Replace the carriage returns with blanks in the csv?
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D in data". I don't control the source of the data so can't make the edits to that.
Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in MS Excel then it's probably a non-compliant "CSV" file.
Recommend to manually edit a sample of the CSV file and get it working so can open with Excel. Then recreate the steps to normalize it programmatically in Python to process the large file.
Use this code to create a sample CSV file copying first ~100 lines into a new file.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
line = csvin.readline()
count = 0
while line and count < 100:
csvout.write(line)
count += 1
line = csvin.readline()
Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines.
My issue is as follows.
I've gathered some contact data from SurveyMonkey using the SM API, and I've converted that data into a txt file. When opening the txt file, I see the full data from the survey that I'm trying to convert into csv, however when I use the following code:
df = pd.read_csv("my_file.txt",sep =",", encoding = "iso-8859-10")
df.to_csv('my_file.csv')
It creates a csv file with only two lines of values (and cuts off in the middle of the second line). Similarly if I try to organize the data within a pandas dataframe, it only registers the first two lines, meaning most of my txt file is not being read registered.
As I've never run into this problem before and I've been able to convert into CSV without issues, I'm wondering if anyone here has ideas as to what might be causing this issue to occur and how I could go about solving it?
All help is much appreciated.
Edit:
I was able to get the data to display properly in csv, when I converted it directly into csv from json instead of converting it to a txt file first. I was not however able to figure out what when wrong in the conversion from txt to csv, as I tried multiple different encodings but came to the same result.
I am attempting to read MapInfo .dat files into .csv files using Python. So far, I have found the easiest way to do this is though xlwings and pandas.
When I do this (below code) I get a mostly correct .csv file. The only issue is that some columns are appearing as symbols/gibberish instead of their real values. I know this because I also have the correct data on hand, exported from MapInfo.
import xlwings as xw
import pandas as pd
app = xw.App(visible=False)
tracker = app.books.open('./cable.dat')
last_row = xw.Range('A1').current_region.last_cell.row
data = xw.Range("A1:AE" + str(last_row))
test_dataframe = data.options(pd.DataFrame, header=True).value
test_dataframe.columns = list(schema)
test_dataframe.to_csv('./output.csv')
When I compare to the real data, I can see that the symbols do actually map the correct number (meaning that (1 = Â?, 2=#, 3=#, etc.)
Below is the first part of the 'dictionary' as to how they map:
My question is this:
Is there an encoding that I can use to turn these series of symbols into their correct representation? The floats aren't the only column affected by this, but they are the most important to my data.
Any help is appreciated.
import pandas as pd
from simpledbf import Dbf5
dbf = Dbf5('path/filename.dat')
df = dbf.to_dataframe()
.dat files are dbase files underneath https://www.loc.gov/preservation/digital/formats/fdd/fdd000324.shtml. so just use that method.
then just output the data
df.to_csv('outpath/filename.csv')
EDIT
If I understand well you are using XLWings to load the .dat file into excel. And then read it into pandas dataframe to export it into a csv file.
Somewhere along this it seems indeed that some binary data is not/incorrectly interpreted and then written as text to you csv file.
directly read dBase file
My first suggestion would be to try to read the input file directly into Python without the use of an excel instance.
According to Wikipedia, mapinfo .dat files are actually are dBase III files. These you can parse in python using a library like dbfread.
inspect data before writing to csv
Secondly, I would inspect the 'corrupted' columns in python instead of immediately writing them to disk.
Either something is going wrong in the excel import and the data of these columns gets imported as text instead of some binary number format,
Or this data is correctly into memory as a byte array (instead of a float), and when you write it to csv, it just gets byte-wise dumped to disk instead of interpreting it as a number format and making a text representation of it
note
Small remark about your initial question regarding mapping text to numbers:
Probably it will not be possible create a straightforward map of characters to numbers:
These numbers could have any encoding and might not be stored as decimal text values like you now seem to assume
These text representations are just a decoding using some character encoding (UTF-8, UTF-16). E.g. for UTF-8 several bytes might map to one character. And the question marks or squares you see, might indicate that one or more characters could not be decoded.
In any case you will be losing information if start from the text, you must start from the binary data to decode.
I've read countless threads on here but I'm still unable to figure out exactly how to do this. I'm using the CSV module in python to write data to a csv file. My difficulty is, I've stored the header files in a list (called header) and it contains a variable number of columns. I need to reference each column name so I can write it to my file, which would be easy, except for the fact that it might contain a variable # of columns and I can't figure out how to have a variable # of arrays that I can write to (of course I'm using zip(*header, list1,list2,list3,...) to write to the csv file, but how to generate the list(i) so that header[i] populates the ith list??? I'm sorry for the lack of code, I just can't figure out how to even begin ...