I have a mock-up dataframe below and resembles very closely to my original dataframe
sof = pd.DataFrame({'id':['1580326032400705442181105','15803260000063243713608360','1580326343500677412104013','15803260343000000705432103406'],'class':['a','c','c','d']})
When i write this dataframe to destop using the 'to_csv' function, i see the ids automatically being converted to the scientific format.(example : 1.5803260324007E+24)
I have a few questions on this
why does python convert this column (obviously of type 'obj') to a numberic format?
How do i preserve my format?
I have tried the following
sof.to_csv('path',float_format='%f',index = False)
Doesnt seem to change anything
sof['id'].astype(int).astype(str)
Trying to convert the supposed "float" to int and then to string
It gives the following error : OverflowError: Python int too large to convert to C long
Can i get some guidance on how this can be achieved?
Related
This is the code I'm using and I have also tried converting my datatype of my columns which is object to float but I got this error
df = pd.read_csv('DDOSping.csv')
pearsoncorr = df.corr(method='pearson')
ValueError: could not convert string to float:
'172.27.224.251-172.27.224.250-56003-502-6'
Somewhere in your CSV this string value exists '172.27.224.251-172.27.224.250-56003-502-6'. Do you know why it's there? What does it represent? It looks to me like it shouldn't be in the data you include in your correlation matrix calculation.
The df.corr method is trying to convert the string value to a float, but it's obviously not possible to do because it's a big complicated string with various characters, not a regular number.
You should clean your CSV of unnecessary data (or make a copy and clean that so you don't lose anything important). Remove anything, like metadata, that isn't the exact data that df.corr needs, including the string in the error message.
If it's just a few values you need to clean then just open in excel or a text editor to do the cleaning. If it's a lot and all the irrelevant data to be removed is in specific rows and/or columns, you could just remove them from your DataFrame before calling 'df.corr' instead of cleaning the file itself.
read_csv contains a lot of parsing logic to detect and convert CSV strings to numerical and datetime Pythong values. My question is, is there a way to call same conversions also on a DataFrame which contains columns with string data, but where the DataFrame is not stored in CSV file but comes from a different (unparsed) source? So only a memory DataFrame object is available.
So saving such DataFrame to a CSV file and reading it back would do such conversion, but this looks very inefficient to me.
If you have e.g. a column of string type, but containing actually a date
(e.g. yyyy-mm-dd), you can use pd.to_datetime() to convert it to Timestamp.
Assuming that the column name is SomeDate, you can call:
df.SomeDate = pd.to_datetime(df.SomeDate)
Another option is to apply any own conversion function to any your column
(search the Web for description of apply).
You didn't give any details, so I can give only such very general advice.
reading a fixed width .dat file in pandas is not very complicated using the pd.read_csv('file.dat', sep='\s+')or the pd.read_fwf('file.dat', widths=[7, ..]) method. But in the file is also given a format string like this:
Format = (i7,1x,i7,1x,i2,1x,i2,1x,i2,1x,f5.1,1x,i4,1x,3i,1x,f4.1,1x,i1,1x,f4.1,1x,i3,1x,i4,1x,i4,1x,i3,1x,i4,2x,i1)
looking at the columns content, I assume the character indicates the datatype (i->int, f->float, x->seperator) and the number is obviously the width of the column. Is this a standard notation? Is there a more pythonic way to read data files by just passing this format string and make scripts save against format changes in the data file?
I noticed the format argument for the read_fwf() function, but it takes a list of pairs (int, int) not the type of format string that is given.
First rows of the data file:
list of pairs (int, int)
This is a pretty standard way to indicate format using the C printf convention. The format is only really important if you are trying to write the file in an identical manner. For the purpose of reading it all into pandas you don't really care. If you want control over the specific data type of each column as you read it in you use the dtype parameter. In the example below I said to make column 'a' a 64-bit floag and 'b' a 32-bit int.
my_dtypes = {‘a’: np.float64, ‘b’: np.int32}
pd.read_csv('file.dat', sep='\s+', dtype=my_dtypes)
You don't have to specify every column, just the ones that you want. It's likely that pandas figured out most of this already though by default. After your call to read_csv() try
df = pd.read_csv(....)
print(df.dtypes)
this will show you the data type of each of your columns.
I am working with pandas data frame which has complex numbers as column data. I am trying to export this DataFrame to excel using DataFrame.to_excel method which throws the following error.
raise ValueError("Cannot convert {0} to Excel".format(value))
ValueError: Cannot convert (1.044574-3496.069365j) to Excel
Is there any roundabout way of doing this? My DataFrame looks like this,
Freq lne_10720_15820_1-lne_10720_18229_1 lne_10720_15820_1 \
48 (1.044574-3496.069365j) (7.576632+64.778558j)
50 (1.049333-3355.448147j) (7.557604+67.544162j)
52 (1.054253-3225.613165j) (7.656567+70.317672j)
A simple solution to your problem is to typecast your variables in the dataframe to string type as follows:
DataFrame = Dataframe.astype(str)
After performing this operation, the types of each variable is an object type and can be verified using DataFrame.dtypes
Hope this helps :)
I am just getting started with Pandas and I am reading in a csv file using the read_csv() method. The difficulty I am having is preventing pandas from converting my telephone numbers to large numbers, instead of keeping them as strings. I defined a converter which just left the numbers alone, but then they still converted to numbers. When I changed my converter to prepend a 'z' to the phone numbers, then they stayed strings. Is there some way to keep them strings without modifying the values of the fields?
Since Pandas 0.11.0 you can use dtype argument to explicitly specify data type for each column:
d = pandas.read_csv('foo.csv', dtype={'BAR': 'S10'})
It looks like you can't avoid pandas from trying to convert numeric/boolean values in the CSV file. Take a look at the source code of pandas for the IO parsers, in particular functions _convert_to_ndarrays, and _convert_types.
https://github.com/pydata/pandas/blob/master/pandas/io/parsers.py
You can always assign the type you want after you have read the file:
df.phone = df.phone.astype(str)