I have a csv file with a column whose type is a dictionary (column 'b' in the example below). But b in df is a string type even though I define it as the dictionary type. I didn't find the solution to this question. Any suggestions?
a = pd.DataFrame({'a': [1,2,3], 'b':[{'a':3, 'haha':4}, {'c':3}, {'d':4}]})
a.to_csv('tmp.csv', index=None)
df = pd.read_csv('tmp.csv', dtype={'b':dict})
I wonder if your CSV column is actually meant to be a Python dict column, or rather JSON. If it's JSON, you can read the column as dtype=str, then use json_normalize() on that column to explode it into multiple columns. This is an efficient solution assuming the column contains valid JSON.
There is no dictionary type in pandas. So you should specify object in case you want normal Python objects:
df = pd.read_csv('tmp.csv', dtype={'b':object})
This will contain strings because pandas doesn't know what dictionaries are. In case you want dictionaries again you could try to "eval" them with ast.literal_eval (safe string evaluation):
df['b'] = df['b'].apply(ast.literal_eval)
print(df['b'][0]['a']) # 3
If you're really confident that you never ever run this on untrusted csvs you could also use eval. But before you consider that I would recommend trying to use a DataFrame only with "native" pandas or NumPy types (or maybe a DataFrame in DataFrame approach). Best to avoid object dtypes as much as possible.
You could use the converters parameter.
From the documentation:
converters : dict, optional
Dict of functions for converting values in certain columns. Keys can
either be integers or column labels.
If you know your column is well formed and includes no missing values then you could do:
df = pd.read_csv('tmp.csv', converters = {'b': ast.literal_eval})
However, for safety (as others have commented) you should probably define your own function with some basic error resilience:
def to_dict(x):
try:
y = ast.literal_eval(x)
if type(y) == dict:
return y
except:
return None
and then:
df = pd.read_csv('tmp.csv', converters = {'b': to_dict})
Related
I have a pandas dataframe, where one column contains sets of strings (each row is a (single) set of strings). However, when I "save" this dataframe to csv, and read it back into a pandas dataframe later, each set of strings in this particular column seems to be saved as a single string. For example the value in this particular row, should be a single set of strings, but it seems to have been read in as a single string:
I need to access this data as a python set of strings, is there a way to turn this back into a set? Or better yet, have pandas read this back in as a set?
You can wrap the string in the "set()" function to turn it back into a set.
string = "{'+-0-', '0---', '+0+-', '0-0-', '++++', '+++0', '+++-', '+---', '0+++', '0++0', '0+00', '+-+-', '000-', '+00-'}"
new_set = set(string)
I think you could use a different separator while converting dataframe to csv.
import pandas as pd
df = pd.DataFrame(["{'Ramesh','Suresh','Sachin','Venkat'}"],columns=['set'])
print('Old df \n', df)
df.to_csv('mycsv.csv', sep= ';', index=False)
new_df = pd.read_csv('mycsv.csv', sep= ';')
print('New df \n',new_df)
Output:
You can use series.apply I think:
Let's say your column of sets was called column_of_sets. Assuming you've already read the csv, now do this to convert back to sets.
df['column_of_sets'] = df['column_of_sets'].apply(eval)
I'm taking eval from #Cabara's comment. I think it is the best bet.
When reading delimited data from a file, the pandas library is able to interpret the types of the data columns.
When passing a pandas dataframe a list of lists of strings assembled through some process outside of pandas, pandas preserves the inner list types as strings:
data = [ ['1','2'],['3','4'] ]
cols = ['foo', 'biz']
df = DataFrame(columns=cols, data=data)
print(numpy.sum(df.values))
$: <literal sum of the strings>
Is there a way to trigger panda's type-interpretation logic on data generated within the running program?
As far as I am aware, that is a feature of the CSV parser pandas uses. You can enforce a single dtype using the dtype argument to the DataFrame constructor, or alternatively, as a post-processing step, you could do:
df.apply(lambda S: pd.to_numeric(S, errors='ignore'))
Try this:
df = pd.DataFrame(columns=cols, data=data, dtype=float)
I have a csv with 4000 over data, in which each cells contains a tuple which holds a specific coordinated. I will like to convert it to a numpy array to work with. I use pandas to convert it into a DataFrame before calling df.values. However after calling df.values, the tuple becomes a string "(x,y)", instead. Is it possible to prevent this happening? Thank you.
df = pd.read_csv(sample_data)
array = df.values
I think problem is from csv always get tuples as strings.
So need convert them:
import ast
df['col'] = df['col'].apply(ast.literal_eval)
Or if all columns are tuples:
df = df.applymap(ast.literal_eval)
It seems that you read the file from local path ?
My answer is use eval to change the string:
df.apply(lambda x:x.apply(eval))
Another way to change the data type after reading the csv:
df['col'].apply(tuple)
I have two dataframes:
import pandas as pd
from quantlib.time.date import Date
cols = ['ColStr','ColDate']
dataset1 = [['A',Date(2017,1,1)],['B',Date(2017,2,2)]]
x = pd.DataFrame(dataset1,columns=cols)
dataset2 = [['A','2017-01-01'],['B','2017-02-04']]
y = pd.DataFrame(dataset2,columns=cols)
Now, I want to compare the two table. I have written another set of code that compares the two (larger) dataframes and works for strings and numerical value.
My problem is - with column 'ColDate' one being string type and other being Date type, I am not able to validate if 'ColStr' = A is a match and 'ColStr' = 'B' is a mismatch.
I would have to
(1) either convert y.ColDate to Date
(2) or convert x.ColDate to str with a similar format as y.ColDate.
How do I achieve one or the other
I guess that you need to cast them to a single common type using something like dataset1['ColDate'] = dataset1.ColDate.map(convert_type) or any other method to iterate column values. Check other functions from pandas docs like apply().
The convert_type function should be defined in your program and accept a single argument to be passed into map().
And, when the columns have same types, you can compare them using any method you like.
You probably want to use the dt.strftime() function.
dataset1[0].dt.strftime("%Y-%m-%d")
I'm using df.columns.values to make a list of column names which I then iterate over and make charts, etc... but when I set this up I overlooked the non-numeric columns in the df. Now, I'd much rather not simply drop those columns from the df (or a copy of it). Instead, I would like to find a slick way to eliminate them from the list of column names.
Now I have:
names = df.columns.values
what I'd like to get to is something that behaves like:
names = df.columns.values(column_type=float64)
Is there any slick way to do this? I suppose I could make a copy of the df, and drop those non-numeric columns before doing columns.values, but that strikes me as clunky.
Welcome any inputs/suggestions. Thanks.
Someone will give you a better answe than this possibly, but one thing I tend to do is if all my numeric data are int64 or float64 objects, then you can create a dict of the column data types and then use the values to create your list of columns.
So for example, in a dataframe where I have columns of type float64, int64 and object firstly you can look at the data types as so:
DF.dtypes
and if they conform to the standard whereby the non-numeric columns of data are all object types (as they are in my dataframes), then you can do the following to get a list of the numeric columns:
[key for key in dict(DF.dtypes) if dict(DF.dtypes)[key] in ['float64', 'int64']]
Its just a simple list comprehension. Nothing fancy. Again, though whether this works for you will depend upon how you set up you dataframe...
dtypes is a Pandas Series.
That means it contains index & values attributes.
If you only need the column names:
headers = df.dtypes.index
it will return a list containing the column names of "df" dataframe.
There's a new feature in 0.14.1, select_dtypes to select columns by dtype, by providing a list of dtypes to include or exclude.
For example:
df = pd.DataFrame({'a': np.random.randn(1000),
'b': range(1000),
'c': ['a'] * 1000,
'd': pd.date_range('2000-1-1', periods=1000)})
df.select_dtypes(['float64','int64'])
Out[129]:
a b
0 0.153070 0
1 0.887256 1
2 -1.456037 2
3 -1.147014 3
...
To get the column names from pandas dataframe in python3-
Here I am creating a data frame from a fileName.csv file
>>> import pandas as pd
>>> df = pd.read_csv('fileName.csv')
>>> columnNames = list(df.head(0))
>>> print(columnNames)
You can also try to get the column names from panda data frame that returns columnn name as well dtype. here i'll read csv file from https://mlearn.ics.uci.edu/databases/autos/imports-85.data but you have define header that contain columns names.
import pandas as pd
url="https://mlearn.ics.uci.edu/databases/autos/imports-85.data"
df=pd.read_csv(url,header = None)
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type",
"num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm"
,"city-mpg","highway-mpg","price"]
df.columns=headers
print df.columns