pandas to_csv converts str column to int(or float) - python

As untitled, I noticed that pandas 'to_csv' transforms automatically columns where there are only alphanumerical strings to float .
I am creating a dataframe in Jupyter notebook and creating a column ['A'] full of values '1'. Hence, I have a dataframe composed of a column of string '1'.
When i convert my dataframe to csv file with 'to_csv'. the output csv file is a one column full of integers 1.
You may advise me to reconvert the column to string when reloaded in jupyter, However that's won't work because I don't know beforehand what columns may be penalized because of this behaviour.
Is there a way to avoid this strange situation.

You can set the quoting parameter in to_csv, take a look at this example:
a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)
df.to_csv('test.csv', sep='\t', quoting=csv.QUOTE_NONNUMERIC)
The created csv file is:
"" 0 1 2
0 "a" "1.2" "4.2"
1 "b" "70" "0.03"
2 "x" "5" "0"
You can also set the quote character with quotechar parameter, e.g. quotechar="'" will produce this output:
'' 0 1 2
0 'a' '1.2' '4.2'
1 'b' '70' '0.03'
2 'x' '5' '0'

One way is to store your types separately and load this with your data:
df = pd.DataFrame({0: ['1', '1', '1'],
1: [2, 3, 4]})
df.dtypes.to_frame('types').to_csv('types.csv')
df.to_csv('file.csv', index=False)
df_types = pd.read_csv('types.csv')['types']
df = pd.read_csv('file.csv', dtype=df_types.to_dict())
print(df.dtypes)
# 0 object
# 1 int64
# dtype: object
You may wish to consider Pickle to ensure your dataframe is guaranteed to be unchanged:
df.to_pickle('file.pkl')
df = pd.read_pickle('file.pkl')
print(df.dtypes)
# 0 object
# 1 int64
# dtype: object

Related

Replacing column indexes with the row below [duplicate]

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Converting Pandas header to string type

I have a dataframe which I read from a CSV as:
df = pd.read_csv(csv_path, header = None)
By default, Pandas assigns the header (df.columns) to be [0, 1, 2, ...] of type int64
What's the best way to to convert this to type str, such that df.columns results in ['0', '1', '2',...] (i.e type str)?
Currently, the best way I can think of doing this is df.columns = list(map(str, df.columns))
Unfortunately, df.astype(str) only affects the values and not the column names
You can use astype(str) with column names like this:
df.columns = df.columns.astype(str)
Example:
In [2472]: l = [1,2]
In [2473]: l1 = [2,3]
In [2475]: df = pd.DataFrame([l, l1])
In [2476]: df
Out[2476]:
0 1
0 1 2
1 2 3
In [2480]: df.columns = df.columns.astype(str)
In [2482]: df.columns
Out[2482]: Index(['0', '1'], dtype='object')

replacing the lines with the headers in pandas [duplicate]

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Pandas - Add Columns to a DataFrame Based in Dict from one of the Columns

I have the pandas.DataFrame below:
One of the columns from the Dataframe, pontos, holds a dict for each of the rows.
What I want to do is add one column to the DataFrame for each key from this dict. So the new columns would be, in this example: rodada, mes, etc, and for each row, these columns would be populated with the respective value from the dict.
So far I've tried the following for one of the keys:
df_times["rodada"] = [df_times["pontos"].get('rodada') for d in df_times["pontos"]]
However, as a result I'm getting a new column rodada filled with None values:
Any hints on what I'm doing wrong?
You can create a new dataframe and concat it to the current one like:
Code:
df2 = pd.concat([df, pd.DataFrame(list(df.pontos))], axis=1)
Test Code:
import pandas as pd
df = pd.DataFrame([
['A', dict(col1='1', col2='2')],
['B', dict(col1='3', col2='4')],
], columns=['X', 'D'])
print(df)
df2 = pd.concat([df, pd.DataFrame(list(df.D))], axis=1)
print(df2)
Results:
X D
0 A {'col2': '2', 'col1': '1'}
1 B {'col2': '4', 'col1': '3'}
X D col1 col2
0 A {'col2': '2', 'col1': '1'} 1 2
1 B {'col2': '4', 'col1': '3'} 3 4
You just need a slight change in your comprehension to extract that data.
It should be:
df_times["rodada"] = [d.get('rodada') for d in
df_times["pontos"]]
You want the values of the dictionary key 'rodada' to be the basis of your new column. So you iterate over those dictionary entries in the loop- in other words, d, and then extract the value by key to make the new column.
you can also use join and pandas apply function:
df=df.join(df['pontos'].apply(pd.Series))

How to read index data as string with pandas.read_csv()?

I'm trying to read csv file as DataFrame with pandas, and I want to read index row as string. However, since the row for index doesn't have any characters, pandas handles this data as integer. How to read as string?
Here are my csv file and code:
[sample.csv]
uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30
[code]
df = pd.read_csv('sample.csv', index_col="uid" dtype=float)
print df.index.values
The result: df.index is integer, not string:
>>> [1 2 3]
But I want to get df.index as string:
>>> ['01', '02', '03']
And an additional condition: The rest of index data have to be numeric value and they're actually too many and I can't point them with specific column names.
pass dtype param to specify the dtype:
In [159]:
import pandas as pd
import io
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
df = pd.read_csv(io.StringIO(t), dtype={'uid':str})
df.set_index('uid', inplace=True)
df.index
Out[159]:
Index(['01', '02', '03'], dtype='object', name='uid')
So in your case the following should work:
df = pd.read_csv('sample.csv', dtype={'uid':str})
df.set_index('uid', inplace=True)
The one-line equivalent doesn't work, due to a still-outstanding pandas bug here where the dtype param is ignored on cols that are to be treated as the index**:
df = pd.read_csv('sample.csv', dtype={'uid':str}, index_col='uid')
You can dynamically do this if we assume the first column is the index column:
In [171]:
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
df = pd.read_csv(io.StringIO(t), dtype=dtypes)
df.set_index('uid', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 01 to 03
Data columns (total 3 columns):
f1 3 non-null float64
f2 3 non-null float64
f3 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
In [172]:
df.index
Out[172]:
Index(['01', '02', '03'], dtype='object', name='uid')
Here we read just the header row to get the column names:
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
we then generate dict of the column names with the desired dtypes:
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
we get the index name, assuming it's the first entry and then create a dict from the rest of the cols and assign float as the desired dtype and add the index col specifying the type to be str, you can then pass this as the dtype param to read_csv
If the result is not a string you have to convert it to be a string.
try:
result = [str(i) for i in result]
or in this case:
print([str(i) for i in df.index.values])

Categories

Resources