How to run parsing logic of Pandas read_csv on custom data?

How to run parsing logic of Pandas read_csv on custom data? - python

read_csv contains a lot of parsing logic to detect and convert CSV strings to numerical and datetime Pythong values. My question is, is there a way to call same conversions also on a DataFrame which contains columns with string data, but where the DataFrame is not stored in CSV file but comes from a different (unparsed) source? So only a memory DataFrame object is available.
So saving such DataFrame to a CSV file and reading it back would do such conversion, but this looks very inefficient to me.

If you have e.g. a column of string type, but containing actually a date
(e.g. yyyy-mm-dd), you can use pd.to_datetime() to convert it to Timestamp.
Assuming that the column name is SomeDate, you can call:
df.SomeDate = pd.to_datetime(df.SomeDate)
Another option is to apply any own conversion function to any your column
(search the Web for description of apply).
You didn't give any details, so I can give only such very general advice.

Related

I have a CSV generated from CICFLOWMETER and I'm unable generate a correlation matrix It either generates a empty data frame

This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'

Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.

How to treat date as plain text with pandas?

I use pandas to read a .csv file, then save it as .xls file. Code as following:
import pandas as pd
df = pd.read_csv('filename.csv', encoding='GB18030')
print(df)
df.to_excel('filename.xls')
There's a column contains date like '2020/7/12', it's looks like pandas recognized it as date and output it to '2020-07-12' automatically. I don't want to format this column, or any other columns like this, I'd like to keep all data remain the same as plain text.
This convertion happens at read_csv(), because print(df) already outputs YYYY-MM-DD, before to_excel().
I tried use df.info() to check the data type of that column, the data type is object. Then I added argument dtype=pd.StringDtype() to read_csv() and it doesn't help.
The file contains Chinese characters so I set encoding to GB18030, don't know if this matters.

My experience concerning pd.read_csv indicates that:
Only columns convertible to int or float are by default
converted to respective types.
"Date-like" strings are still read as strings (the column type in
the resulting DataFrame is actually object).
If you want read_csv to convert such column to datetime type, you
should pass parse_dates parameter, specifying a list of columns to be
parsed as dates. Since you didn't do it, no source column should be
converted to datetime type.
To check this detail, after you read file, run file.info() and check
the type of the column in question.
So if respective Excel file column is of Date type, then probably
this conversion is caused by to_excel.
And one more remark concerning variable names:
What you have read using read_csv is a DataFrame, not a file.
Actual file is the source object, from which you read the content,
but here you passed only file name.
So don't use names like file to name the resulting DataFrame, as this
is misleading. It is much better to use e.g. df.
Edit following a comment as of 05:58Z
To check in full extent what you wrote in your comment, I created
the following CSV file:
DateBougth,Id,Value
2020/7/12,1031,500.15
2020/8/18,1032,700.40
2020/10/16,1033,452.17
I ran: df = pd.read_csv('Input.csv') and then print(df), getting:
DateBougth Id Value
0 2020/7/12 1031 500.15
1 2020/8/18 1032 700.40
2 2020/10/16 1033 452.17
So, at the Pandas level, no format conversion occurred in DateBougth
column. Both remaining columns, contain numeric content, so they were
silently converted to int64 and float64, but DateBought remained as object.
Then I saved this df to an Excel file, running: df.to_excel('Output.xls')
and opened it with Excel. The content is:
So neither at the Excel level any data type conversion took place.
To see the actual data type of B2 cell (the first DateBougth),
I clicked on this cell and pressed Ctrl-1, to display cell formatting.
The format is General (not Date), just as I expected.
Maybe you have some outdated version of software?
I use Python v. 3.8.2 and Pandas v. 1.0.3.
Another detail to check: Look at your code after pd.read_csv.
Maybe somewhere you put instruction like df.DateBought = pd.to_datetime(df.DateBought) (explicit type conversion)?
Or at least format conversion. Note that in my environment
there was absolutely no change in the format of DateBought column.

Problem solved. I double checked my .csv file, opened it with notepad, the data is 2020-07-12, which displays as 2020/7/12 on Office. Turns out that Office reformatted date to yyyy/m/d (based on your region). I'm developing a tool to process and import data to DB for my company, we did these work manually by copy and paste so no one noticed this issue. Thanks to #Valdi_Bo for his investigate and patience.

reading dat files with pandas by format string

reading a fixed width .dat file in pandas is not very complicated using the pd.read_csv('file.dat', sep='\s+')or the pd.read_fwf('file.dat', widths=[7, ..]) method. But in the file is also given a format string like this:
Format = (i7,1x,i7,1x,i2,1x,i2,1x,i2,1x,f5.1,1x,i4,1x,3i,1x,f4.1,1x,i1,1x,f4.1,1x,i3,1x,i4,1x,i4,1x,i3,1x,i4,2x,i1)
looking at the columns content, I assume the character indicates the datatype (i->int, f->float, x->seperator) and the number is obviously the width of the column. Is this a standard notation? Is there a more pythonic way to read data files by just passing this format string and make scripts save against format changes in the data file?
I noticed the format argument for the read_fwf() function, but it takes a list of pairs (int, int) not the type of format string that is given.
First rows of the data file:
list of pairs (int, int)

This is a pretty standard way to indicate format using the C printf convention. The format is only really important if you are trying to write the file in an identical manner. For the purpose of reading it all into pandas you don't really care. If you want control over the specific data type of each column as you read it in you use the dtype parameter. In the example below I said to make column 'a' a 64-bit floag and 'b' a 32-bit int.
my_dtypes = {‘a’: np.float64, ‘b’: np.int32}
pd.read_csv('file.dat', sep='\s+', dtype=my_dtypes)
You don't have to specify every column, just the ones that you want. It's likely that pandas figured out most of this already though by default. After your call to read_csv() try
df = pd.read_csv(....)
print(df.dtypes)
this will show you the data type of each of your columns.

Pandas - Appending 'table' format to HDF5Store with different dtypes: invalid combinate of [values_axes]

I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.

Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.

Specifying data type in Pandas csv reader

I am just getting started with Pandas and I am reading in a csv file using the read_csv() method. The difficulty I am having is preventing pandas from converting my telephone numbers to large numbers, instead of keeping them as strings. I defined a converter which just left the numbers alone, but then they still converted to numbers. When I changed my converter to prepend a 'z' to the phone numbers, then they stayed strings. Is there some way to keep them strings without modifying the values of the fields?

Since Pandas 0.11.0 you can use dtype argument to explicitly specify data type for each column:
d = pandas.read_csv('foo.csv', dtype={'BAR': 'S10'})

It looks like you can't avoid pandas from trying to convert numeric/boolean values in the CSV file. Take a look at the source code of pandas for the IO parsers, in particular functions _convert_to_ndarrays, and _convert_types.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py
You can always assign the type you want after you have read the file:
df.phone = df.phone.astype(str)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.