This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'
Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.
I've been working on filtering dataframe project. I have a table on Excel, and I converted it to UTF-8 CSV file. I made all my columns on Excel as number with after comma-2 digit.
However as you can see in figure some my columns are different. Default should be xx.xx but some columns seen as xx.xxxxx on dataframe. In xx.xx columns I can filter properly, but the other columns are making problems. I tried to filter it like xx.xxxxx but it didn't work again. How can I get rid of this problem?
In wiew data frame tool of pycharm I can format it but this works for only viewing. What should I do about this?
You can use round to choose the given number of digits you want to keep. This way, you can have an equal number of decimals in all columns.
I am using Python 3.7
I need to load data from two different sources (both csv) and determine which rows from the one sources are not in the second source.
I have used pandas data-frames to load the data and do a comparison between the two sources of data.
I loaded the data from the csv file and a value like 2010392 is turned to 2010392.0 in the data-frame column.
I have read quite a number of articles about formatting data-frame columns; unfortunately, most of them are about date and time conversions.
I came across an article "Format integer column of Data-frame in Python pandas" at http://www.datasciencemadesimple.com/format-integer-column-of-dataframe-in-python-pandas/ which does not solve my problem
Based on the above mentioned article I have tried the following:
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Out[63]:
0 2010392.0
1 111777967.0
2 2010392.0
3 2012554.0
4 2010392.0
5 2010392.0
6 2010392.0
7 1170126.0
and as you can see, the column values still have a decimal point with a zero.
I expect the load of the dataframe from a csv file to keep the format of a number such as 2010392 to be 2010392 and not 2010392.0
Here is the code that I have tried:
import pandas as pd
data = pd.read_csv("timetable_all_2019-2_groups.csv")
data02 = data.drop_duplicates()
print(f'Len data {len(data)}')
print(data.head(20))
print(f'Len data02 {len(data02)}')
print(data02.head(20))
pd.to_numeric(data02['IDDLECT'], downcast='integer')
Here is a few lines of the content of the csv file:
The data in the one source looks like this:
IDDCYR,IDDSUBJ,IDDOT,IDDGRPTYP,IDDCLASSGROUP,IDDLECT,IDDPRIMARY
019,AAACA1B,VF,C,A1,2010392,Y
2019,AAACA1B,VF,C,A1,111777967,N
2019,AAACA3B,VF,C,A1,2010392,Y
2019,AAACA3B,VF,C,A1,2012554,N
2019,AAACB2A,VF,C,B1,2010392,Y
2019,AAACB2A,VF,P,B2,2010392,Y
2019,AAACB2A,VF,C,B1,2010392,N
2019,AAACB2A,VF,P,B2,1170126,N
2019,AAACH1A,VF,C,A1,2010392,Y
Looks like you have data which is not of integer type. Once loaded you should do something about that data and then convert the column to int.
From your error description, you have nans and/or inf values. You could impute the missing values with the mode, mean, median or a constant value. You can achieve that either with pandas or with sklearn imputer, which is dedicated to imputing missing values.
Note that if you use mean, you may end up with a float number, so make sure to get the mean as an int.
The imputation method you choose really depends on what you'll use this data for later. If you want to understand the data, filling nans with 0 may destroy aggregation functions later (e.g. if you'll want to know what the mean is, it won't be accurate).
That being said, I see you're dealing with categorical data. One option here is to use dtype='category'. If you want to later fit a model with this and you leave ids as numbers, the model can conclude weird things which are not correct (e.g. the sum of two ids equals to some third id, or that ids that are higher are more important than lower ones... things that a priori make no sense and should not be ignored and left to chance.)
Hope this helps!
data02['IDDLECT'] = data02['IDDLECT']fillna(0).astype('int')
read_csv contains a lot of parsing logic to detect and convert CSV strings to numerical and datetime Pythong values. My question is, is there a way to call same conversions also on a DataFrame which contains columns with string data, but where the DataFrame is not stored in CSV file but comes from a different (unparsed) source? So only a memory DataFrame object is available.
So saving such DataFrame to a CSV file and reading it back would do such conversion, but this looks very inefficient to me.
If you have e.g. a column of string type, but containing actually a date
(e.g. yyyy-mm-dd), you can use pd.to_datetime() to convert it to Timestamp.
Assuming that the column name is SomeDate, you can call:
df.SomeDate = pd.to_datetime(df.SomeDate)
Another option is to apply any own conversion function to any your column
(search the Web for description of apply).
You didn't give any details, so I can give only such very general advice.
I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.
Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.