An extra column appearing in my .csv file [duplicate] - python

I have a situation wherein sometimes when I read a csv from df I get an unwanted index-like column named unnamed:0.
file.csv
,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
The CSV is read with this:
pd.read_csv('file.csv')
Unnamed: 0 A B C
0 0 1 2 3
1 1 4 5 6
2 2 7 8 9
This is very annoying! Does anyone have an idea on how to get rid of this?

It's the index column, pass pd.to_csv(..., index=False) to not write out an unnamed index column in the first place, see the to_csv() docs.
Example:
In [37]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
pd.read_csv(io.StringIO(df.to_csv()))
Out[37]:
Unnamed: 0 a b c
0 0 0.109066 -1.112704 -0.545209
1 1 0.447114 1.525341 0.317252
2 2 0.507495 0.137863 0.886283
3 3 1.452867 1.888363 1.168101
4 4 0.901371 -0.704805 0.088335
compare with:
In [38]:
pd.read_csv(io.StringIO(df.to_csv(index=False)))
Out[38]:
a b c
0 0.109066 -1.112704 -0.545209
1 0.447114 1.525341 0.317252
2 0.507495 0.137863 0.886283
3 1.452867 1.888363 1.168101
4 0.901371 -0.704805 0.088335
You could also optionally tell read_csv that the first column is the index column by passing index_col=0:
In [40]:
pd.read_csv(io.StringIO(df.to_csv()), index_col=0)
Out[40]:
a b c
0 0.109066 -1.112704 -0.545209
1 0.447114 1.525341 0.317252
2 0.507495 0.137863 0.886283
3 1.452867 1.888363 1.168101
4 0.901371 -0.704805 0.088335

This is usually caused by your CSV having been saved along with an (unnamed) index (RangeIndex).
(The fix would actually need to be done when saving the DataFrame, but this isn't always an option.)
Workaround: read_csv with index_col=[0] argument
IMO, the simplest solution would be to read the unnamed column as the index. Specify an index_col=[0] argument to pd.read_csv, this reads in the first column as the index. (Note the square brackets).
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
# Save DataFrame to CSV.
df.to_csv('file.csv')
<!- ->
pd.read_csv('file.csv')
Unnamed: 0 a b c
0 0 x x x
1 1 x x x
2 2 x x x
3 3 x x x
4 4 x x x
# Now try this again, with the extra argument.
pd.read_csv('file.csv', index_col=[0])
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
Note
You could have avoided this in the first place by
using index=False if the output CSV was created in pandas, if your DataFrame does not have an index to begin with:
df.to_csv('file.csv', index=False)
But as mentioned above, this isn't always an option.
Stopgap Solution: Filtering with str.match
If you cannot modify the code to read/write the CSV file, you can just remove the column by filtering with str.match:
df
Unnamed: 0 a b c
0 0 x x x
1 1 x x x
2 2 x x x
3 3 x x x
4 4 x x x
df.columns
# Index(['Unnamed: 0', 'a', 'b', 'c'], dtype='object')
df.columns.str.match('Unnamed')
# array([ True, False, False, False])
df.loc[:, ~df.columns.str.match('Unnamed')]
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x

To get ride of all Unnamed columns, you can also use regex such as df.drop(df.filter(regex="Unname"),axis=1, inplace=True)

Another case that this might be happening is if your data was improperly written to your csv to have each row end with a comma. This will leave you with an unnamed column Unnamed: x at the end of your data when you try to read it into a df.

You can do either of the following with 'Unnamed' Columns:
Delete unnamed columns
Rename them (if you want to use them)
Method 1: Delete Unnamed Columns
# delete one by one like column is 'Unnamed: 0' so use it's name
df.drop('Unnamed: 0', axis=1, inplace=True)
#delete all Unnamed Columns in a single code of line using regex
df.drop(df.filter(regex="Unnamed"),axis=1, inplace=True)
Method 2: Rename Unnamed Columns
df.rename(columns = {'Unnamed: 0':'Name'}, inplace = True)
If you want to write out with a blank header as in the input file, just choose 'Name' above to be ''.
where the OP's input data 'file.csv' was:
,A,B,C
0,1,2,3
1,4,5,6
2,7,8,9
#read file
df = pd.read_csv('file.csv')

Simply delete that column using: del df['column_name']

Simple do this:
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

Alternatively:
df = df.drop(columns=['Unnamed: 0'])

from IPython.display import display
import pandas as pd
import io
df = pd.read_csv('file.csv',index_col=[0])
df = pd.read_csv(io.StringIO(df.to_csv(index=False)))
display(df.head(5))

A solution that is agnostic to whether the index has been written or not when utilizing df.to_csv() is shown below:
df = pd.read_csv(file_name)
if 'Unnamed: 0' in df.columns:
df.drop('Unnamed: 0', axis=1, inplace=True)
If an index was not written, then index_col=[0] will utilize the first column as the index which is behavior that one would not want.

In my experience, there are many reasons you might not want to set that column as index_col =[0] as so many people suggest above. For example it might contain jumbled index values because data were saved to csv after being indexed or sorted without df.reset_index(drop=True) leading to instant confusion.
So if you know the file has this column and you don't want it, as per the original question, the simplest 1-line solutions are:
df = pd.read_csv('file.csv').drop(columns=['Unnamed: 0'])
or
df = pd.read_csv('file.csv',index_col=[0]).reset_index(drop=True)

Related

Pivot table to "tidy" data frame in Pandas

I have an array of numbers (I think the format makes it a pivot table) that I want to turn into a "tidy" data frame. For example, I start with variable 1 down the left, variable 2 across the top, and the value of interest in the middle, something like this:
X Y
A 1 2
B 3 4
I want to turn that into a tidy data frame like this:
V1 V2 value
A X 1
A Y 2
B X 3
B Y 4
The row and column order don't matter to me, so the following is totally acceptable:
value V1 V2
2 A Y
4 B Y
3 B X
1 A X
For my first go at this, which was able to get me the correct final answer, I looped over the rows and columns. This was terribly slow, and I suspected that some machinery in Pandas would make it go faster.
It seems that melt is close to the magic I seek, but it doesn't get me all the way there. That first array turns into this:
V2 value
0 X 1
1 X 2
2 Y 3
3 Y 4
It gets rid of my V1 variable!
Nothing is special about melt, so I will be happy to read answers that use other approaches, particularly if melt is not much faster than my nested loops and another solution is. Nonetheless, how can I go from that array to the kind of tidy data frame I want as the output?
Example dataframe:
df = pd.DataFrame({"X":[1,3], "Y":[2,4]},index=["A","B"])
Use DataFrame.reset_index with DataFrame.rename_axis and then DataFrame.melt. If you want order columns we could use DataFrame.reindex.
new_df = (df.rename_axis(index = 'V1')
.reset_index()
.melt('V1',var_name='V2')
.reindex(columns = ['value','V1','V2']))
print(new_df)
Another approach DataFrame.stack:
new_df = (df.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
print(new_df)
value V1 V2
0 1 A X
1 3 B X
2 2 A Y
3 4 B Y
to names names there is another alternative like commenting #Scott Boston in the comments
Melt is a good approach, but it doesn't seem to play nicely with identifying the results by index. You can reset the index first to move it to its own column, then use that column as the id col.
test = pd.DataFrame([[1,2],[3,4]], columns=['X', 'Y'], index=['A', 'B'])
X Y
A 1 2
B 3 4
test = test.reset_index()
index X Y
0 A 1 2
1 B 3 4
test.melt('index',['X', 'Y'], 'prev cols')
index prev cols value
0 A X 1
1 B X 3
2 A Y 2
3 B Y 4

Ascending order of Excel rows

I have the following rows in Excel:
How can I put them in an ascending order in Python (i.e. notice how the row starting with 12 comes before that starting with 118).
I think the Pandas library would be a starting point? Any clue is appreciated.
Thanks.
First read the excel file
df = pd.read_excel("your/file/path/file.xls")
df
data
0 1212.i.jpg
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
then make a substring of the data
assuming the column name is "data"
df["sub"] = df["data"].str[:-6]
Just in case, convert the new column to type int
df["sub"] = df["sub"].astype(int)
Now sort the values, by that new column
df.sort_values("sub", inplace=True)
Finaly, if you only want your original data:
df = df["data"]
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Using natsorted
from natsort import natsorted
df.data=natsorted(df.data)
df
Out[129]:
data
0 121.i.jpg
1 212.i.jpg
2 512.i.jpg
3 1212.i.jpg
Keep original data index
df.loc[natsorted(df.index,key=lambda x : df.data[x] )]
Out[138]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg
Or using argsort with split
df.iloc[np.argsort(df.data.str.split('.').str[0].astype(int))]
Out[141]:
data
1 121.i.jpg
2 212.i.jpg
3 512.i.jpg
0 1212.i.jpg

How to delete all columns in DataFrame except certain ones?

Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)

how to write the pivot_table to txt file by python

I have get the pivot_table as follows:
there are spaces in the table,
what i want to write to txt is:
how to get it ?
chaoshidishi=pd.pivot_table(clsc,index="故障发生地市",values="工单号",aggfunc=len)
chaoshidishi=chaoshidishi.to_frame()
f=open('E:\gaotie\dishi.txt','w')
for row in chaoshidishi:
f.write(row[0]+row[1])
f.close()
Following up on #shanmuga's comment, you should be able to use to_csv() without first using to_frame().
First, here's some sample data that seems to reflect your setup:
import pandas as pd
group = ['a','a','b','c','c']
value = [1,2,3,4,5]
df = pd.DataFrame({'group':group,'value':value})
print(df)
group value
0 a 1
1 a 2
2 b 3
3 c 4
4 c 5
Now apply pivot_table():
df.pivot_table(columns='group', values='value', aggfunc=len)
group
a 2
b 1
c 2
Name: value, dtype: int64
You can save to file directly from this output. If you don't want to preserve index and column names, use header=None on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt'))
newdf = pd.read_csv('foo.txt', header=None)
print(newdf)
0 1
0 a 2
1 b 1
2 c 2
To preserve column and index names, use the header argument on save, and the index_col argument on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt', header='group'))
newdf = pd.read_csv('foo.txt', index_col='group')
print(newdf)
value
group
a 2
b 1
c 2

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Categories

Resources