I'm coming to Python from a SAS background.
I've imported a SAS version 5 transport file (XPT) into python using:
df = pd.read_sas(r'C:\mypath\myxpt.xpt')
The file is a simple SAS transport file, converted from a SAS dataset created with the following:
DATA myxpt;
DO i = 1 TO 10;
y = "XXX";
OUTPUT;
END;
RUN;
The file imports correctly and I can view the contents using:
print(df)
screenshot showing print of dataframe
However, when I view the file using the variable explorer, all character columns are shown as blank.
Screenshot showing data frame viewed through Variable explorer
I've tried reading this as a sas dataset instead of a transport file and importing this into Python but have the same problem.
I've also tried creating a dataframe within python containing character columns and this displays correctly within the variable explorer.
Any suggestions what's going wrong?
Thanks in advance.
Column Y is a column of binary strings. You have to decode it first. The variable explorer cannot guess the correct encoding and apparently does not show binary strings. If you do not know the encoding you will have to guess. Try df['utf8']=df.Y.str.decode('utf8') and see if the info makes any sense.
As you have noted, it is possible to specify the encoding in the import function:
df = pd.read_sas(r'C:\mypath\myxpt.xpt', encoding='utf8')
As a sidenote, you should always be aware and preferably explicit of the encodings in use to avoid major headaches.
For a list of all available encodings and ther aliases check here.
Related
I am facing a problem related to reading an Excel file using pandas which I have not been able to solve yet. The file
The file has several sheets, so I have written a function for each of them, what I aim is to automatize the splitting and saving for each of the sheets I am interested in.
The problem is that choosing encoding URT-8 I am not able to read the special characters we have in the Spanish dictionary such as:á, é, í, ó, ú, or ñ. I had a similar issue reading a CSV, which I solved using encoding Latin-1 and engine python. However, it seems I cannot use either this encoding or the engine since it is Excel file (I might be wrong, but I haven't been able to run the chunk because it throws me an error).
Here is the code ( I'm just writing one of the functions since they are pretty much the same, the only thing that changes are the directories):
# I am joining the path and the path and the name of the excel I have previously written
path_final = os.path.join(path, excel)
# Reading the excel choosing the sheet I want:
df_employees = pd.read_excel(path_final, sheetname='General (employees&interns)', encoding='UTF-8')
# Path to the directory I want to save the CSV in.
employees_path = General/03 HR Analytics/01. Dashboard/'+ year +'/'+ month +'/Processed/Employees/'
# We save it, with the name and format
df_employees.to_csv(employees_path +"HR_Data_Template_Data_" + business_list[position]+"_Employees.csv" , sep=';', encoding='UTF-8')
And this is what excel looks like before and after they´ve gone through the functions. These names have been made up for the example:
# Before (original):
Carreño Julian
Natalia García
Gutiérrez Gutiérrez Pedro
# after:
Carreño Julian
Natlaia GarcÃa
Gutiérrez Gutiérrez Pedro
Do you know what might be that cause because UTF-8 is not working reading excels? Do you guys know an alternative way of doing it?
If you have encountered the same problem, please give me a hint!
Set a environment variable "GAME_FILENAME" using the following code in Qt where args[0] contains a file name alättet.DTA -
qputenv("GAME_FILENAME", qUtf8Printable(args[0])); //args[0]='\temp\alättet.DTA'
When the environment variable is retrieved in Python code with the following code, variable files is as shown here:
One character 'ä' became two. Looking into fBytes, there are 4 bytes corresponding to 'ä'.
files = os.environ.get('GAME_FILENAME')
fBytes = bytes(os.environ.get('GAME_FILENAME'))
Verified on the Qt side by storing the result of qgetenv('GAME_FILENAME') in a QByteArray. The bytes look correct. Also in Python, tried os.environ['GAME_FILENAME']="alättet.DTA", then os.environ.get('GAMRY_FILENAME') returns "alättet.DTA" correctly.
Anything missed here? Thanks.
I am trying to open an excel file from python, get it to recalculate and then save it with the newly calculated values.
The spreadsheet is large and opens fine in LibreOffice with GUI, and initially shows old values. If I then do a Data->Calculate->Recalculate Hard I see the correct values, and I can of course saveas and all seems fine.
But, there are multiple large spreadsheets I want to do it from so I don't want to use a GUI instead I want to use Python. The following all seems to work to create a new spreadsheet but it doesn't have the new values (unless I again manually do a recalculate hard)
I'm running on Linux. First I do this:
soffice --headless --nologo --nofirststartwizard --accept="socket,host=0.0.0.0,port=8100,tcpNoDelay=1;urp"
Then, here is sample python code:
import uno
local = uno.getComponentContext()
resolver = local.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", local)
context = resolver.resolve("uno:socket,host=localhost,port=8100;urp;StarOffice.ServiceManager")
remoteContext = context.getPropertyValue("DefaultContext")
desktop = context.createInstanceWithContext("com.sun.star.frame.Desktop", remoteContext)
document = desktop.getCurrentComponent()
file_url="file://foo.xlsx"
document = desktop.loadComponentFromURL(file_url, "_blank", 0, ())
controller=document.getCurrentController()
sheet=document.getSheets().getByIndex(0)
controller.setActiveSheet(sheet)
document.calculateAll()
file__out_url="file://foo_out.xlsx"
from com.sun.star.beans import PropertyValue
pv_filtername = PropertyValue()
pv_filtername.Name = "FilterName"
pv_filtername.Value = "Calc MS Excel 2007 XML"
document.storeAsURL(file__out_url, (pv_filtername,))
document.dispose()
After running the above code, and opening foo_out.xlsx it shows the "old" values, not the recalculated values. I know that the calculateAll() is taking a little while, as I would expect for it to do the recalculation. But, the new values don't seem to actually get saved.
If I open it in Excel it does an auto-recalculate and shows the correct values and if I open in LibreOffice and do Recalculate Hard it shows the correct values. But, what I need is to save it, from python like above, so that it already contains the recalculated values.
Is there any way to do that?
Essentially, what I want to do from python is:
open, recalculate hard, saveas
It seems that this was a problem with an older version of LibreOffice. I was using 5.0.6.2, on Linux, and even though I was recalculating, the new values were not even showing up when I extracted the cell values directly.
However, I upgraded to 6.2 and the problem has gone away, using the same code and the same input files.
I decided to just answer my own question, instead of deleting it, as this was leading to a frustration until I solved it.
I'm trying to read in a file
the text file itself is laid out in 9 columns with tons of data (454 lines total)
I'm trying to read in and retrieve certain columns of data so I can plot a diagram of the mass related to temperature (an HR diagram)
however when I try to load the text using:
file = 'nameoftext.txt' #the file itself is saved as a txt from notepad++
track1 = np.loadtext(file, skiprows=70) #im skipping 70 rows of headers to the data (and np is numpy)
I get an error saying:
ValueError: could not convert string to float: 'iso'
I have no idea what this means or what I'm doing.
I'm also using np.loadtext because that's the only way my professor showed us how to load files and I have no idea how else to do it.
another option for loading .txt files in the python is the genfromtxt() function also in numpy. In this function the object type of values in each column can be specified or you can allow the function to guess the type on its own.
Check out the link below for a similar question.
Loading text file containing both float and string using numpy.loadtxt
I've been working on some dataframes with Python. I load them in using readCSV(filename, index=0) and it's all fine. The files also open fine in Excel. I also opened them in notepad, and the seem alright; below is an example line:
851,1.218108787,0.636454978,0.269719611,-0.849476404,-0.143909689,0.050626813,-0.094248374,-0.3096134,-0.131347142,0.671271112,0.167593329,0.439417259,-0.198164647,-0.031552824,-0.215189948,-0.1791156,0.092648696,-0.107840318,-0.162596466,0.019324121,0.040572892,-0.008307331,-0.077819297,-0.023809355,-0.148229913,-0.041082835,0.138234498,-0.070986117,0.024788437,-0.050982962,0.24689969,0
The first column is as I understand it an index column. Then there's a bunch of Principal Components, and at the end is a 1/0.
When I try and load the file into WEKA, however, it gives me a nasty error and urges me to use the converter, saying:
Reason:
32 Problem encountered on line: 2
When I attempt to use the converter with the default settings, it states a new error:
Couldn't read object file_name.csv invalid stream header: 2C636F6D
Could anyone help with any of this? I can't provide the entire data file but if requested I can try and maybe cut out a few rows and only paste those if the error still occurs. Are there any flags I need to specify when saving a file to CSV in python? At the moment I just use a .toCSV('x.csv').
I think the index column not having an issue would prevent weka from reading it, when you write using pandas.to_csv() set the index = False
df.to_csv(index = False)