Remove Unnamed columns in pandas dataframe [duplicate] - python

This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 4 years ago.
I have a data file from columns A-G like below but when I am reading it with pd.read_csv('data.csv') it prints an extra unnamed column at the end for no reason.
colA ColB colC colD colE colF colG Unnamed: 7
44 45 26 26 40 26 46 NaN
47 16 38 47 48 22 37 NaN
19 28 36 18 40 18 46 NaN
50 14 12 33 12 44 23 NaN
39 47 16 42 33 48 38 NaN
I have seen my data file various times but I have no extra data in any other column. How I should remove this extra column while reading ? Thanks

df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
In [162]: df
Out[162]:
colA ColB colC colD colE colF colG
0 44 45 26 26 40 26 46
1 47 16 38 47 48 22 37
2 19 28 36 18 40 18 46
3 50 14 12 33 12 44 23
4 39 47 16 42 33 48 38
NOTE: very often there is only one unnamed column Unnamed: 0, which is the first column in the CSV file. This is the result of the following steps:
a DataFrame is saved into a CSV file using parameter index=True, which is the default behaviour
we read this CSV file into a DataFrame using pd.read_csv() without explicitly specifying index_col=0 (default: index_col=None)
The easiest way to get rid of this column is to specify the parameter pd.read_csv(..., index_col=0):
df = pd.read_csv('data.csv', index_col=0)

First, find the columns that have 'unnamed', then drop those columns. Note: You should Add inplace = True to the .drop parameters as well.
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

The pandas.DataFrame.dropna function removes missing values (e.g. NaN, NaT).
For example the following code would remove any columns from your dataframe, where all of the elements of that column are missing.
df.dropna(how='all', axis='columns')

The approved solution doesn't work in my case, so my solution is the following one:
''' The column name in the example case is "Unnamed: 7"
but it works with any other name ("Unnamed: 0" for example). '''
df.rename({"Unnamed: 7":"a"}, axis="columns", inplace=True)
# Then, drop the column as usual.
df.drop(["a"], axis=1, inplace=True)
Hope it helps others.

Related

How to display truncated data points in a dataframe and csv files [duplicate]

I work with Series and DataFrames on the terminal a lot. The default __repr__ for a Series returns a reduced sample, with some head and tail values, but the rest missing.
Is there a builtin way to pretty-print the entire Series / DataFrame? Ideally, it would support proper alignment, perhaps borders between columns, and maybe even color-coding for the different columns.
You can also use the option_context, with one or more options:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(df)
This will automatically return the options to their previous values.
If you are working on jupyter-notebook, using display(df) instead of print(df) will use jupyter rich display logic (like so).
No need to hack settings. There is a simple way:
print(df.to_string())
Sure, if this comes up a lot, make a function like this one. You can even configure it to load every time you start IPython: https://ipython.org/ipython-doc/1/config/overview.html
def print_full(x):
pd.set_option('display.max_rows', len(x))
print(x)
pd.reset_option('display.max_rows')
As for coloring, getting too elaborate with colors sounds counterproductive to me, but I agree something like bootstrap's .table-striped would be nice. You could always create an issue to suggest this feature.
After importing pandas, as an alternative to using the context manager, set such options for displaying entire dataframes:
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.max_colwidth', None) # or 199
For full list of useful options, see:
pd.describe_option('display')
Use the tabulate package:
pip install tabulate
And consider the following example usage:
import pandas as pd
from io import StringIO
from tabulate import tabulate
c = """Chromosome Start End
chr1 3 6
chr1 5 7
chr1 8 9"""
df = pd.read_table(StringIO(c), sep="\s+", header=0)
print(tabulate(df, headers='keys', tablefmt='psql'))
+----+--------------+---------+-------+
| | Chromosome | Start | End |
|----+--------------+---------+-------|
| 0 | chr1 | 3 | 6 |
| 1 | chr1 | 5 | 7 |
| 2 | chr1 | 8 | 9 |
+----+--------------+---------+-------+
Using pd.options.display
This answer is a variation of the prior answer by lucidyan. It makes the code more readable by avoiding the use of set_option.
After importing pandas, as an alternative to using the context manager, set such options for displaying large dataframes:
def set_pandas_display_options() -> None:
"""Set pandas display options."""
# Ref: https://stackoverflow.com/a/52432757/
display = pd.options.display
display.max_columns = 1000
display.max_rows = 1000
display.max_colwidth = 199
display.width = 1000
# display.precision = 2 # set as needed
set_pandas_display_options()
After this, you can use either display(df) or just df if using a notebook, otherwise print(df).
Using to_string
Pandas 0.25.3 does have DataFrame.to_string and Series.to_string methods which accept formatting options.
Using to_markdown
If what you need is markdown output, Pandas 1.0.0 has DataFrame.to_markdown and Series.to_markdown methods.
Using to_html
If what you need is HTML output, Pandas 0.25.3 does have a DataFrame.to_html method but not a Series.to_html. Note that a Series can be converted to a DataFrame.
If you are using Ipython Notebook (Jupyter). You can use HTML
from IPython.core.display import HTML
display(HTML(df.to_html()))
Try this
pd.set_option('display.height',1000)
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
datascroller was created in part to solve this problem.
pip install datascroller
It loads the dataframe into a terminal view you can "scroll" with your mouse or arrow keys, kind of like an Excel workbook at the terminal that supports querying, highlighting, etc.
import pandas as pd
from datascroller import scroll
# Call `scroll` with a Pandas DataFrame as the sole argument:
my_df = pd.read_csv('<path to your csv>')
scroll(my_df)
Disclosure: I am one of the authors of datascroller
Scripts
Nobody has proposed this simple plain-text solution:
from pprint import pprint
pprint(s.to_dict())
which produces results like the following:
{'% Diabetes': 0.06365372374283895,
'% Obesity': 0.06365372374283895,
'% Bachelors': 0.0,
'% Poverty': 0.09548058561425843,
'% Driving Deaths': 1.1775938892425206,
'% Excessive Drinking': 0.06365372374283895}
Jupyter Notebooks
Additionally, when using Jupyter notebooks, this is a great solution.
Note: pd.Series() has no .to_html() so it must be converted to pd.DataFrame()
from IPython.display import display, HTML
display(HTML(s.to_frame().to_html()))
which produces results like the following:
You can set expand_frame_repr to False:
display.expand_frame_repr : boolean
Whether to print out the full DataFrame repr for wide DataFrames
across multiple lines, max_columns is still respected, but the output
will wrap-around across multiple “pages” if its width exceeds
display.width.
[default: True]
pd.set_option('expand_frame_repr', False)
For more details read How to Pretty-Print Pandas DataFrames and Series
this link can help ypu
hi my friend just run this
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(df)
just do this
Output
Column
0 row 0
1 row 1
2 row 2
3 row 3
4 row 4
5 row 5
6 row 6
7 row 7
8 row 8
9 row 9
10 row 10
11 row 11
12 row 12
13 row 13
14 row 14
15 row 15
16 row 16
17 row 17
18 row 18
19 row 19
20 row 20
21 row 21
22 row 22
23 row 23
24 row 24
25 row 25
26 row 26
27 row 27
28 row 28
29 row 29
30 row 30
31 row 31
32 row 32
33 row 33
34 row 34
35 row 35
36 row 36
37 row 37
38 row 38
39 row 39
40 row 40
41 row 41
42 row 42
43 row 43
44 row 44
45 row 45
46 row 46
47 row 47
48 row 48
49 row 49
50 row 50
51 row 51
52 row 52
53 row 53
54 row 54
55 row 55
56 row 56
57 row 57
58 row 58
59 row 59
60 row 60
61 row 61
62 row 62
63 row 63
64 row 64
65 row 65
66 row 66
67 row 67
68 row 68
69 row 69
You can achieve this using below method. just pass the total no. of columns present in the DataFrame as arg to
'display.max_columns'
For eg :
df= DataFrame(..)
with pd.option_context('display.max_rows', None, 'display.max_columns', df.shape[1]):
print(df)
Try using display() function. This would automatically use Horizontal and vertical scroll bars and with this you can display different datasets easily instead of using print().
display(dataframe)
display() supports proper alignment also.
However if you want to make the dataset more beautiful you can check pd.option_context(). It has lot of options to clearly show the dataframe.
Note - I am using Jupyter Notebooks.

How to obtain the first 4 rows for every 20 rows from a CSV file

I've Read the CVS file using pandas and have managed to print the 1st, 2nd, 3rd and 4th row for every 20 rows using .iloc.
Prem_results = pd.read_csv("../data sets analysis/prem/result.csv")
Prem_results.iloc[:320:20,:]
Prem_results.iloc[1:320:20,:]
Prem_results.iloc[2:320:20,:]
Prem_results.iloc[3:320:20,:]
Is there a way using iloc to print the 1st 4 rows of every 20 lines together rather then seperately like I do now? Apologies if this is worded badly fairly new to both python and using pandas.
Using groupby.head:
Prem_results.groupby(np.arange(len(Prem_results)) // 20).head(4)
You can concat slices together like this:
pd.concat([df[i::20] for i in range(4)]).sort_index()
MCVE:
df = pd.DataFrame({'col1':np.arange(1000)})
pd.concat([df[i::20] for i in range(4)]).sort_index().head(20)
Output:
col1
0 0
1 1
2 2
3 3
20 20
21 21
22 22
23 23
40 40
41 41
42 42
43 43
60 60
61 61
62 62
63 63
80 80
81 81
82 82
83 83
Start at 0 get every 20 rows
Start at 1 get every 20 rows
Start at 2 get every 20 rows
And, start at 3 get every 20 rows.
You can also do this while reading the csv itself.
df = pd.DataFrame()
for chunk in pd.read_csv(file_name, chunksize = 20):
df = pd.concat((df, chunk.head(4)))
More resources:
You can read more about the usage of chunksize in Pandas official documentation here.
I also have a post about its usage here.

Plotting different predictions with same column names and categories Python/Seaborn

I have df with different groups. I have two predictions (iqr, median).
cntx_iqr pred_iqr cntx_median pred_median
18-54 83 K18-54 72
R18-54 34 R18-54 48
25-54 33 18-34 47
K18-54 29 18-54 47
18-34 27 R25-54 29
K18-34 25 25-54 23
K25-54 24 K25-54 14
R18-34 22 R18-34 8
R25-54 17 K18-34 6
Now I want to plot them using seaborn and I have melted data for pilots. However, it does not look right to me.
pd.melt(df, id_vars=['cntx_iqr', 'cntx_median'], value_name='category', var_name="kind")
I am aiming to compare predictions (pred_iqr,pred_median) from those 2 groups (cntx_iqr, cntx_median) maybe stack barplot or some other useful plot to see how each group differs for those 2 predictions.
any help/suggestion would be appreciated
Thanks in advance
Not sure how you obtained the data frame, but you need to match the values first:
df = df[['cntx_iqr','pred_iqr']].merge(df[['cntx_median','pred_median']],
left_on="cntx_iqr",right_on="cntx_median")
df.head()
cntx_iqr pred_iqr cntx_median pred_median
0 18-54 83 18-54 47
1 R18-54 34 R18-54 48
2 25-54 33 25-54 23
3 K18-54 29 K18-54 72
4 18-34 27 18-34 47
Once you have this, you can just make a scatterplot:
sns.scatterplot(x = 'pred_iqr',y = 'pred_median',data=df)
The barplot requires a bit of pivoting, but should be:
sns.barplot(x = 'cntx_iqr', y = 'value', hue='variable',
data = df.melt(id_vars='cntx_iqr',value_vars=['pred_iqr','pred_median']))

Create a pandas dataframe from dictionary whilst maintaining order of columns

When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())

Pandas sort() ignoring negative sign

I want to sort a pandas df but I'm having problems with the negative values.
import pandas as pd
df = pd.read_csv('File.txt', sep='\t', header=None)
#Suppress scientific notation (finally)
pd.set_option('display.float_format', lambda x: '%.8f' % x)
print(df)
print(df.dtypes)
print(df.shape)
b = df.sort(axis=0, ascending=True)
print(b)
This gives me the ascending order but completely disregards the sign.
SPATA1 -0.00000005
HMBOX1 0.00000005
SLC38A11 -0.00000005
RP11-571M6.17 0.00000004
GNRH1 -0.00000004
PCDHB8 -0.00000004
CXCL1 0.00000004
RP11-48B3.3 -0.00000004
RNFT2 -0.00000004
GRIK3 -0.00000004
ZNF483 0.00000004
RP11-627G18.1 0.00000003
Any ideas what I'm doing wrong?
Thanks
Loading your file with:
df = pd.read_csv('File.txt', sep='\t', header=None)
Since sort(....) is deprecated, you can use sort_values:
b = df.sort_values(by=[1], axis=0, ascending=True)
where [1] is your column of values. For me this returns:
0 1
0 ACTA1 -0.582570
1 MT-CO1 -0.543877
2 CKM -0.338265
3 MT-ND1 -0.306239
5 MT-CYB -0.128241
6 PDK4 -0.119309
8 GAPDH -0.090912
9 MYH1 -0.087777
12 RP5-940J5.9 -0.074280
13 MYH2 -0.072261
16 MT-ND2 -0.052551
18 MYL1 -0.049142
19 DES -0.048289
20 ALDOA -0.047661
22 ENO3 -0.046251
23 MT-CO2 -0.043684
26 RP11-799N11.1 -0.034972
28 TNNT3 -0.032226
29 MYBPC2 -0.030861
32 TNNI2 -0.026707
33 KLHL41 -0.026669
34 SOD2 -0.026166
35 GLUL -0.026122
42 TRIM63 -0.022971
47 FLNC -0.018180
48 ATP2A1 -0.017752
49 PYGM -0.016934
55 hsa-mir-6723 -0.015859
56 MT1A -0.015110
57 LDHA -0.014955
.. ... ...
60 RP1-178F15.4 0.013383
58 HSPB1 0.014894
54 UBB 0.015874
53 MIR1282 0.016318
52 ALDH2 0.016441
51 FTL 0.016543
50 RP11-317J10.2 0.016799
46 RP11-290D2.6 0.018803
45 RRAD 0.019449
44 MYF6 0.019954
43 STAC3 0.021931
41 RP11-138I1.4 0.023031
40 MYBPC1 0.024407
39 PDLIM3 0.025442
38 ANKRD1 0.025458
37 FTH1 0.025526
36 MT-RNR2 0.025887
31 HSPB6 0.027680
30 RP11-451G4.2 0.029969
27 AC002398.12 0.033219
25 MT-RNR1 0.040741
24 TNNC1 0.042251
21 TNNT1 0.047177
17 MT-ND3 0.051963
15 MTND1P23 0.059405
14 MB 0.063896
11 MYL2 0.076358
10 MT-ND5 0.076479
7 CA3 0.100221
4 MT-ND6 0.140729
[18152 rows x 2 columns]

Categories

Resources