exporting pandas dataframe of complex numbers to excel - python

I am working with pandas data frame which has complex numbers as column data. I am trying to export this DataFrame to excel using DataFrame.to_excel method which throws the following error.
raise ValueError("Cannot convert {0} to Excel".format(value))
ValueError: Cannot convert (1.044574-3496.069365j) to Excel
Is there any roundabout way of doing this? My DataFrame looks like this,
Freq lne_10720_15820_1-lne_10720_18229_1 lne_10720_15820_1 \
48 (1.044574-3496.069365j) (7.576632+64.778558j)
50 (1.049333-3355.448147j) (7.557604+67.544162j)
52 (1.054253-3225.613165j) (7.656567+70.317672j)

A simple solution to your problem is to typecast your variables in the dataframe to string type as follows:
DataFrame = Dataframe.astype(str)
After performing this operation, the types of each variable is an object type and can be verified using DataFrame.dtypes
Hope this helps :)

Related

Writing a Pandas Dataframe to csv without the scientific Format

I have a mock-up dataframe below and resembles very closely to my original dataframe
sof = pd.DataFrame({'id':['1580326032400705442181105','15803260000063243713608360','1580326343500677412104013','15803260343000000705432103406'],'class':['a','c','c','d']})
When i write this dataframe to destop using the 'to_csv' function, i see the ids automatically being converted to the scientific format.(example : 1.5803260324007E+24)
I have a few questions on this
why does python convert this column (obviously of type 'obj') to a numberic format?
How do i preserve my format?
I have tried the following
sof.to_csv('path',float_format='%f',index = False)
Doesnt seem to change anything
sof['id'].astype(int).astype(str)
Trying to convert the supposed "float" to int and then to string
It gives the following error : OverflowError: Python int too large to convert to C long
Can i get some guidance on how this can be achieved?

Convert timestamp to datetime for a Vaex dataframe

I have a parquet file that I have loaded as a Vaex dataframe. The parquette file has a column for a timestamp in the format 2022-10-12 17:10:00+00:00.
When I try to do any kind of analysis with my dataframe I get the following error.
KeyError: "Unknown variables or column: 'isna(timestamp)'"
When I remove that column everything works. I assume that the time column is not in the correct format. But I have been having trouble converting it.
I tryed
df['timestamp']= pd.to_datetime(df['timestamp'].astype(str)) but I get the error <class 'vaex.expression.Expression'> is not convertible to datetime I assume I can't mix pandas and vaex.
I am also having trouble producing a minimal reproducible example since when I create a dataframe, the datatime column would be a string and everthing works fine.

Passing PySpark pandas_udf data limit?

The problem is simple. Please observe the code below.
#pyf.pandas_udf(pyt.StructType(RESULTS_SCHEMA_LIST), pyf.PandasUDFType.GROUPED_MAP)
def train_udf(df):
return train_ml_model(df=df)
results_df = complete_df.groupby('training-zone').apply(train_udf)
One of the columns of the results_df is typically a very large string (>4e6 characters). While this isn't a problem for a pandas.DataFrame or a spark.DataFrame when I convert the pandas dataframe to a spark dataframe. It is a problem when the pandas_udf() attempts to do this. The error returned is pyarrrow.lib.ArrowInvalid could not convert **string** with type pyarrow.lib.StringValue: did not recognize the Python value type when inferring an Arrow data type
This UDF does work if I don't return the problematic column or I make the problematic column only contain some small string such as 'wow that is cool', so I know the problem is not with the udf itself per se.
I know the function train_ml_model() works because when I get a random group from the spark dataframe then convert it to a pandas dataframe and pass it to train_ml_model() it produces the expected pandas dataframe with the column with a massive string.
I know spark can handle such large strings because when I convert the pandas dataframe to a spark dataframe using spark.createDataFrame() the spark dataframe contains the full expected value.
PS: Why is pyarrow even trying to infer the datatype when I pass the types to the pandas_udf()?
Any help would be very much appreciated!

What's the easiest way to replace categorical columns of data with codes in Pandas?

I have a table of data in .dta format which I have read into python using Pandas. The data is mostly in the categorical data type and I want to replace the columns with numerical data that can be used with machine learning, such as boolean (1/0) or codes. The trouble is that I can't directly replace the data because it won't let me change the categories, unless I add them.
I have tried using pd.get_dummies(), but it keeps returning an error:
TypeError: 'columns' is an invalid keyword argument for this function
print(pd.get_dummies(feature).head(), columns=['smkevr', 'cignow', 'dnnow',
'dnever', 'complst'])
Is there a simple way to replace this data with numerical codes based on the value (for example 'Not applicable' = 0)?
I do it the following way:
df_dumm = pd.get_dummies(feature).head()
df_dumm.columns = ['smkevr', 'cignow', 'dnnow',
'dnever', 'complst']
print (df_dumm.head())

Convert numbers to strings when reading an excel spreadsheet into a pandas DataFrame

I'm reading some excel spreadsheets (xlsx format) into pandas using read_excel, which generally works great. The problem I have is that when a column contains numbers, pandas converts these to float64 type, and I would like them to be treated as strings. After reading them in, I can convert the column to str:
my_frame.my_col = my_frame.my_col.astype('str')
This works as far as assigning the right type to the column, but when I view the values in this column, the strings are formatted in scientific-format e.g. 8.027770e+14, which is not what I want. I'd like to work out how to tell pandas to read columns as strings, or do the conversion later so that I get values in their original (non-scientific) format.
pandas.read_csv() has a dtype argument:
dtype : Type name or dict of column -> type
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
I solve it with round, if you do round(number,5) in most case you will not lose data and you will get zero in the case of 8.027770e+14

Categories

Resources