Saving and Loading of dataframe to csv results in Unnamed columns - python

prob in the title. exaple:
x=[('a','a','c') for i in range(5)]
df = DataFrame(x,columns=['col1','col2','col3'])
df.to_csv('test.csv')
df1 = read_csv('test.csv')
Unnamed: 0 col1 col2 col3
0 0 a a c
1 1 a a c
2 2 a a c
3 3 a a c
4 4 a a c
The reason seems to be that when saving a dataframe, the index column is written also, with no name in the header. then when you load the csv again, it is loaded with the index column as unnamed column. Is this a bug? How can I avoid writing a csv with the index, or dropping unnamed columns in reading?

You can remove row labels via the index and index_label parameters of to_csv.

These are not symmetric as there are ambiguities in the csv format because of the positioning. You need to specify an index_col on read-back
In [1]: x=[('a','a','c') for i in range(5)]
In [2]: df = DataFrame(x,columns=['col1','col2','col3'])
In [3]: df.to_csv('test.csv')
In [4]: !cat test.csv
,col1,col2,col3
0,a,a,c
1,a,a,c
2,a,a,c
3,a,a,c
4,a,a,c
In [5]: pd.read_csv('test.csv',index_col=0)
Out[5]:
col1 col2 col3
0 a a c
1 a a c
2 a a c
3 a a c
4 a a c
This looks very similar to the above, so is 'foo' a column or an index?
In [6]: df.index.name = 'foo'
In [7]: df.to_csv('test.csv')
In [8]: !cat test.csv
foo,col1,col2,col3
0,a,a,c
1,a,a,c
2,a,a,c
3,a,a,c
4,a,a,c

You can specify explicitly which columns you want to write using cols parameter.

That s how use index
df.to_csv('test.csv', index_label=False)
But for me, when I've tried submit to Kaggle it's return error "ERROR: Record 1 had 3 columns but expected 2", so I solved it use this code.

Related

Drop / move header of a dataframe into first row

I have a dataframe who looks like this:
A B 10
0 A B 20
1 C A 10
so the headers are not the real headers of the dataframe (I have to map them from another dataframe), how can I drop the headers in this case into the first row, that it looks like this:
0 1 2
0 A B 10
1 A B 20
2 C A 10
Note that pd.read_csv(..., header=None) leads to an error in this case, I don't know why, so I am searching for a solution to fix it after I load the file.
The best is avoid it by header=None parameter in read_csv:
df = pd.read_csv(file, header=None)
If not possible append columns names converted to one row DataFrame to original data and then set range to columns names:
df = df.columns.to_frame().T.append(df, ignore_index=True)
df.columns = range(len(df.columns))
print (df)
0 1 2
0 A B 10
1 A B 20
2 C A 10
Let us try reset_index for fixing
df = df.T.reset_index().T

how to write the pivot_table to txt file by python

I have get the pivot_table as follows:
there are spaces in the table,
what i want to write to txt is:
how to get it ?
chaoshidishi=pd.pivot_table(clsc,index="故障发生地市",values="工单号",aggfunc=len)
chaoshidishi=chaoshidishi.to_frame()
f=open('E:\gaotie\dishi.txt','w')
for row in chaoshidishi:
f.write(row[0]+row[1])
f.close()
Following up on #shanmuga's comment, you should be able to use to_csv() without first using to_frame().
First, here's some sample data that seems to reflect your setup:
import pandas as pd
group = ['a','a','b','c','c']
value = [1,2,3,4,5]
df = pd.DataFrame({'group':group,'value':value})
print(df)
group value
0 a 1
1 a 2
2 b 3
3 c 4
4 c 5
Now apply pivot_table():
df.pivot_table(columns='group', values='value', aggfunc=len)
group
a 2
b 1
c 2
Name: value, dtype: int64
You can save to file directly from this output. If you don't want to preserve index and column names, use header=None on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt'))
newdf = pd.read_csv('foo.txt', header=None)
print(newdf)
0 1
0 a 2
1 b 1
2 c 2
To preserve column and index names, use the header argument on save, and the index_col argument on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt', header='group'))
newdf = pd.read_csv('foo.txt', index_col='group')
print(newdf)
value
group
a 2
b 1
c 2

Prevent pandas read_csv treating first row as header of column names

I'm reading in a pandas DataFrame using pd.read_csv. I want to keep the first row as data, however it keeps getting converted to column names.
I tried header=False but this just deleted it entirely.
(Note on my input data: I have a string (st = '\n'.join(lst)) that I convert to a file-like object (io.StringIO(st)), then build the csv from that file object.)
You want header=None the False gets type promoted to int into 0 see the docs emphasis mine:
header : int or list of ints, default ‘infer’ Row number(s) to use as
the column names, and the start of the data. Default behavior is as if
set to 0 if no names passed, otherwise None. Explicitly pass header=0
to be able to replace existing names. The header can be a list of
integers that specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be skipped
(e.g. 2 in this example is skipped). Note that this parameter ignores
commented lines and empty lines if skip_blank_lines=True, so header=0
denotes the first line of data rather than the first line of the file.
You can see the difference in behaviour, first with header=0:
In [95]:
import io
import pandas as pd
t="""a,b,c
0,1,2
3,4,5"""
pd.read_csv(io.StringIO(t), header=0)
Out[95]:
a b c
0 0 1 2
1 3 4 5
Now with None:
In [96]:
pd.read_csv(io.StringIO(t), header=None)
Out[96]:
0 1 2
0 a b c
1 0 1 2
2 3 4 5
Note that in latest version 0.19.1, this will now raise a TypeError:
In [98]:
pd.read_csv(io.StringIO(t), header=False)
TypeError: Passing a bool to header is invalid. Use header=None for no
header or header=int or list-like of ints to specify the row(s) making
up the column names
I think you need parameter header=None to read_csv:
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""a,b
2,1
1,1"""
df = pd.read_csv(StringIO(temp),header=None)
print (df)
0 1
0 a b
1 2 1
2 1 1
If you're using pd.ExcelFile to read all the excel file sheets then:
df = pd.ExcelFile("path_to_file.xlsx")
df.sheet_names # Provide the sheet names in the excel file
df = df.parse(2, header=None) # Parsing the 2nd sheet in the file with header = None
df
Output:
0 1
0 a b
1 1 1
2 0 1
3 5 2
You can set custom column name in order to prevent this:
Let say if you have two columns in your dataset then:
df = pd.read_csv(your_file_path, names = ['first column', 'second column'])
You can also generate programmatically column names if you have more than and can pass a list in front of names attribute.

Naming pandas columns from arbitrary row data

I have a python script where I have read in a csv file using pandas:
colnames = ['col1','col2','col3','col4','col5','col6','col7','col8','col9','col10']
csv_input = pd.read_csv(ifile, names=colnames)
The CSV file is filled with lots of uneeded junk, but the column names I want to use are defined by a row with DataName in col1.
csv_names = csv_input[csv_input.col1 == 'DataName']
The actual data is in rows with DataValue in col1, and I don't need the rest.
csv_input = csv_input[csv_input.col1 == 'DataValue']
What I'd like to do is rename the columns in csv_input with the values of csv_names, but I can't find the right syntax to do this. I have tried
csv_input.columns = csv_names.values
Which gives the error
ValueError: Length mismatch: Expected axis has 10 elements, new values have 1 elements
Any suggestions greatly appreciated.
You should be able to just directly assign them like so:
In [28]:
df = pd.DataFrame({'a':[0,'e',1], 'b':[0,'f',2],'c':[0,'g',2]})
df
Out[28]:
a b c
0 0 0 0
1 e f g
2 1 2 2
In [29]:
df.columns = df.loc[1]
df
Out[29]:
1 e f g
0 0 0 0
1 e f g
2 1 2 2
so in your case just do:
csv_input.columns = csv_names

How to remove duplicate columns from a dataframe using python pandas

By grouping two columns I made some changes.
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
It's probably easiest to use a groupby (assuming they have duplicate names too):
In [11]: df
Out[11]:
A B B
0 a 4 4
1 b 4 4
2 c 4 4
In [12]: df.T.groupby(level=0).first().T
Out[12]:
A B
0 a 4
1 b 4
2 c 4
If they have different names you can drop_duplicates on the transpose:
In [21]: df
Out[21]:
A B C
0 a 4 4
1 b 4 4
2 c 4 4
In [22]: df.T.drop_duplicates().T
Out[22]:
A B
0 a 4
1 b 4
2 c 4
Usually read_csv will usually ensure they have different names...
Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442
This is the best I found so far.
remove = []
cols = df.columns
for i in range(len(cols)-1):
v = df[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,df[cols[j]].values):
remove.append(cols[j])
df.drop(remove, axis=1, inplace=True)
https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code
It's already answered here python pandas remove duplicate columns.
Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.
Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.
column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]
I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):
df.drop(df.columns[i], axis=1)
The fast solution for dataset without NANs:
share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]

Categories

Resources