I'm trying to plot data read into Pandas from a xlsx file. After some minor formatting and data quality checks, I try to plot using matplotlib but get the following error:
TypeError: Empty 'DataFrame': no numeric data to plot
This is not a new issue and I have followed many of the pages on this site dealing with this very problem. The posted suggestions, unfortunately, have not worked for me.
My data set includes strings (locations of sampling sites and limited to the first column), dates (which I have converted to the correct format using pd.to_datetime), many NaN entries (that cannot be converted to zeros due to the graphical analysis we are doing), and column headings representing various analytical parameters.
As per some of the suggestions I read on this site, I have tried the following code
df = df.astype(float) which gives me the following error ValueError: could not convert string to float: 'Site 1' (Site 1 is a sampling location)
df = df.apply(pd.to_numeric, errors='ignore') which gives me the following: dtypes: float64(13), int64(1), object(65) and therefore does not appear to work as most of the data remains as an object. The date entries are the int64 and I cannot figure out why some of the data columns are float64 and some remain as objects
df = df.apply(pd.to_numeric, errors='coerce') which deletes the entire DataFrame, possibly because this operation fills the entire DataFrame with NaN?
I'm stuck and would appreciate any insight.
EDIT
I was able to solve my own question based on some of the feedback. Here is what worked for me:
df = "path"
header = [0] # keep column headings as first row of original data
skip = [1] # skip second row, which has units of measure
na_val = ['.','-.','-+0.01'] # Convert spurious decimal points that have
# no number associated with them to NaN
convert = {col: float for col in (4,...,80)} # Convert specific rows to
# float from original text
parse_col = ("A","C","E:CC") # apply to specific columns
df = pd.read_excel(df, header = header, skiprows = skip,
na_values = na_val, converters = convert, parse_columns = parse_col)
Hard to answer without a data sample, but if you are sure that the numeric columns are 100% numeric, this will probably work:
for c in df.columns:
try:
df[c] = df[c].astype(int)
except:
pass
Related
I'm using pandas to load a short_desc.csv with the following columns: ["report_id", "when","what"]
with
#read csv
shortDesc = pd.read_csv('short_desc.csv')
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
#convert 'when' from UNIX timestamp to datetime
shortDesc['when'] = pd.to_datetime(shortDesc['when'],unit='s')
which results in the following:
I'm trying to remove rows that have duplicate 'report_id's by sorting by
date and getting the newest date where that 'report_id' is present with the following:
shortDesc = shortDesc.sort_values(by='when').drop_duplicates(['report_id'], keep='last')
the problem is that when I use .sort_values() in this particular dataframe the values of 'what' come out scattered across all columns, and the 'report_id' values disappear:
shortDesc = shortDesc.sort_values(by=['when'], inplace=False)
I'm not sure why this is happening in this particular instance since I was able to achieve the correct results by another dataframe with the same shape and using the same code (P.S it's not a mistake, I dropped the 'what' column in the second pic):
similar shape dataframe
desired results example with similar shape DF
I found out that:
#get all numerical and nonnull values
shortDesc = shortDesc[shortDesc['report_id'].str.isdigit().notnull()]
was only checking if a value was not null and probably overwriting the str.isdigit() check, which caused the field "report_id" to not drop nonnumeric values. I changed this to two separate lines
shortDesc = shortDesc[shortDesc['report_id'].notnull()]
shortDesc = shortDesc[shortDesc['report_id'].str.isnumeric()]
which allowed
shortDesc.sort_values(by='when', inplace=True)
to work as intended, I am still confused as to why .sort_values(by="when") was affected by the column "report_id". So if anyone knows please enlighten me.
I have run across an unexplainable problem with Pandas regarding inserting a numpy array into the cell of a Dataframe. I am aware of the standard fixes for this error, but this case seems to be a little stranger.
I load a .csv file into a dataframe, collapse the last 1000 rows into a numpy array and place them back into a cell in a new dataframe.
The header format of the original file is essentially:
["TIMESTAMP", "Header0", ... , "Header10", "Waveform (1)", ... , "Waveform (1001)"]
# Load in dataframe from csv file, skipping some extra headers
input_df = pd.read_csv(filepath, header=1, skiprows=[2, 3], index_col='TIMESTAMP', parse_dates=True)
# Header columns that are not part of the waveform
header_columns = ["Header0", "Header1", "Header2", "Header3",
"Header4", "Header5", "Header6", "Header7",
"Header8", "Header9", "Header10"]
# Create a version of the input df that is only the Waveform columns, dropping the last one
waveform_df = input_df.drop(columns=header_columns).iloc[:, :-1]
# Create a version of the input df that is only the Header columns
header_df = input_df[indexing_columns]
# Create an output df that is a copy of the header df with a column added for storing the compressed waveforms
output_df = header_df.copy()
output_df['Waveform'] = None
# for every row in the waveform dict (which will be the same len as output_df)
for index, row in waveform_df.iterrows():
# create a numpy array from all the waveform columns of that row
waveform_array = waveform_df.loc[index].to_numpy()
# save the numpy array in the 'Waveform' column of the current row by index
output_df.at[index, 'Waveform'] = waveform_array
# return the compressed df
return output_df
This code actually works perfectly fine most of the time. The file I originally tested it on worked fine, and had 30 or so rows in it. However, when I attempted to run the same code on a different file with the same header formats which was about 1200 rows, it gave the error:
ValueError: Must have equal len keys and value when setting with an iterable
for the line:
output_df.at[index, 'Waveform'] = waveform_array
I compared the two incessantly and discovered that the only real difference between the two is the length, and in fact, if I crop the longer dataframe to a similar length to the shorter one, the error disappears. (Specifically, it works so long as I crop the dataframe to less than 250 rows)
I'd really like to be able to do this without cropping the dataframe and rebuilding it, so I was wondering if anyone had an insight into why this error was occurring.
Thanks!
I'm having a problem with converting my data to fro dataframe to percentage format and keep it as a float.
I prepared a simple code thats reflects the code from my actual project:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,15,size=(10, 4)), columns=list('ABCD'))
print(df)
cols = df.columns
for col in cols:
df[col] = df[col].astype(float).map(lambda n: '{:.4%}'.format(n))
print(df)
print(df.dtypes)
In my actual project I need to choose ONLY columns that contain specific string and do some calculation on their values. At the end I have to change the formatting to percentage with 4 decimal places. Eventhough I use astype(float)my values are still str.
Consequently, when I save dataframe to excel file, values are pasted as text and not as number.
In addition, while creating a line chart from this dataframe, I get an error 'unhashable type: 'numpy.ndarray'.
Please advise on how to succefully convert data to percentage format and keep it as a float in order to get accurate paste in excel file and creating a line chart with matplotlib.
Thanks a lot!
I believe this is because you are using .format after .astype(float). The code first converts the values to float but then as format is a str function, it gets converted to string with 4 decimal places. You can try doing the following:
df[col] = df[col].map(lambda n: '{:.4%}'.format(n)).astype(float)
OR
You can try dividing your code line into two parts inside the for loop:
df[col] = df[col].map(lambda n: '{:.4%}'.format(n))
df[col] = df[col].astype(float)
Hope that works!
For a dataframe which looks like this:
I want to simply set the index to be the Date column which you see as first column.
The dataframe comes from an api where i save the data into csv:
data.to_csv('stocks.csv', header=True ,sep=',',mode='a')
data = pd.read_csv('stocks.csv',header=[0,1,2])
data
Preferably i would also like to get rid of the "Unnamed:.." labels you see in the picture.
Thanks.
I solved it by specifying header=[0,1] ,index_col=0 in the read_csv function and after convert dataframe to numeric since the datatype got distorted but not necessary always i believe:
data = pd.read_csv('stocks.csv', header=[0,1] ,index_col=0)
data = data.apply(pd.to_numeric, errors='coerce')
# eventually:
data = data.dropna()
In this fashion I get exactly what I want, namely write e.g.
data['AGN.AS']['High']
and get the high values for a specific stock.
Trying to answer this question Get List of Unique String per Column we ran into a different problem from my dataset. When I import this CSV file to the dataframe every column is OBJECT type, we need to convert the columns that are just number to real (number) dtype and those that are not number to String dtype.
Is there a way to achieve this?
Download the data sample from here
I have tried following code from following article Pandas: change data type of columns but did not work.
df = pd.DataFrame(a, columns=['col1','col2','col3'])
As always thanks for your help
Option 1
use pd.to_numeric in an apply
df.apply(pd.to_numeric, errors='ignore')
Option 2
use pd.to_numeric on df.values.ravel
cvrtd = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(-1, len(df.columns))
pd.DataFrame(np.where(np.isnan(cvrtd), df.values, cvrtd), df.index, df.columns)
Note
These are not exactly the same. For some column that contains mixed values, option 2 converts what it can while option 2 leaves everything in that column an object. Looking at your file, I'd choose option 1.
Timing
df = pd.read_csv('HistorianDataSample/HistorianDataSample.csv', skiprows=[1, 2])