How to read a column header with a line break into Pandas? - python

I have a csv column heading:
"Submission S
tatus"
csv headers:
Unit,Publication ID,Title,"Submission S
tatus",Notes,Name,User ID
How can I refer to this when reading into the dataframe with the usecols parameter (or alternatively when renaming at a later stage)?
I have tried:
df = pd.read_csv('myfile.csv', usecols = ['Submission S\ntatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S\rtatus']
error: Usecols do not match columns, columns expected but not found
df = pd.read_csv('myfile.csv', usecols = ['Submission S
tatus']
error: SyntaxError: EOL while scanning string literal
How should I be referring to this column?

This is not the answer you wanted, but I hope it will help you if you want any workaround for this.
df = pd.read_csv('myfile.csv', usecols = [n])
df.rename(columns={df.columns[n]: "new column name"}, inplace=True)
# n is your column postion

You can read a csv file using the traditonal way of statements:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name
You can save the column names by
colnames = df.Names
Later replace the name of your specified column using a meaningful word

Related

When trying to drop a column from my dataset using pandas, i get the error "['churn'] not found in axis"

I want the x to be all the columns except "churn" column.
But when i do the below i get the "['churn'] not found in axis" error, eventhough i can see the column name when i write "print(list(df.column))"
Here is my code:
import pandas as pd
import numpy as np
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0)
print(df.head())
print(df.columns)
print(len(df.columns))
x = df.drop(["churn"], axis=1) ## this is the part it gives the error
I am adding a snippet of my dataset as well:
account_length;area_code;international_plan;voice_mail_plan;number_vmail_messages;total_day_minutes;total_day_calls;total_day_charge;total_eve_minutes;total_eve_calls;total_eve_charge;total_night_minutes;total_night_calls;total_night_charge;total_intl_minutes;total_intl_calls;total_intl_charge;number_customer_service_calls;churn;
1;KS;128;area_code_415;no;yes;25;265.1;110;45.07;197.4;99;16.78;244.7;91;11.01;10;3;2.7;1;no
2;OH;107;area_code_415;no;yes;26;161.6;123;27.47;195.5;103;16.62;254.4;103;11.45;13.7;3;3.7;1;no
3;NJ;137;area_code_415;no;no;0;243.4;114;41.38;121.2;110;10.3;162.6;104;7.32;12.2;5;3.29;0;no
I see that your df snippet is separeted with ';' (semi colon). If that is what your actual data looks like, then probably your csv is being read wrong. Please try adding sep=';' to read_csv function.
df = pd.read_csv("/Users/utkusenel/Documents/Data Analyzing/data.csv", header=0, sep=';')
I also suggest print df.columns again and check if there is a leading or trailing whitespace in the column name for churn.

Drop columns contains certain strings while reading data : python

I'm reading .txt files in a directory and want to drop columns that contains some certain string.
for file in glob.iglob(files + '.txt', recursive=True):
cols = list(pd.read_csv(file, nrows =1))
df=pd.read_csv(file,header=0, skiprows=0, skipfooter=0, usecols =[i for i in cols if i.str.contains['TRIVIAL|EASY']==False])
when I do this I'm getting
df=pd.read_csv(file,header=0, skiprows=0, skipfooter=0, usecols =[i for i >in cols if i.str.contains['PASS']==True])
AttributeError: 'str' object has no attribute 'str'
Which part I need tp fix I could not figured it out ?
select columns based on columns names containing a specific string in pandas
drop column based on a string condition
AttributeError: 'str' object has no attribute 'str'
Drop multiple columns that end with certain string in Pandas
Without reading the header separately you would pass a callable to usecols. Check whether 'EASY' or 'TRIVIAL' are not in the column name.
exclu = ['EASY', 'TRIVIAL'] # Any substring in this list excludes a column
usecols = lambda x: not any(substr in x for substr in exclu)
df = pd.read_csv('test.csv', usecols=usecols)
print(df)
HARD MEDIUM
0 2 4
1 6 8
2 1 1
Sample Data: test.csv
TRIVIAL,HARD,EASYfoo,MEDIUM
1,2,3,4
5,6,7,8
1,1,1,1
few issues in your code, first you are using str.contains on the whole dataframe not the columns, secondly the str contains cannot be used on a list.
using regex
import re
cols = pd.read_csv(file, nrows =1)
cols_to_use = [i for i in cols.columns if not re.search('TRIVIAL|EASY',i)]
df=pd.read_csv(file,header=0, skiprows=0, skipfooter=0, usecols =cols_to_use)

Reading multi header excel sheet in pandas

I have a multi header excel sheet without any index column. When I read the excel in pandas, it treats first column as an index. I want pandas to create an index instead of treating 1st column as an index. Any help would be appreciated.
I tried below code:
df = pd.read_excel(file, header=[1,2], sheetname= "Ratings Inputs", parse_cols ="A:AA", index_col=None)
From my tests, read_csv seems broken with a multi_line header: when index_col is absent or None, it behaves as is it was 0.
You have 2 possible workarounds here:
reset_index as suggested by #mounaim:
df = pd.read_excel(file, header=[1,2], sheetname= "Ratings Inputs",
parse_cols ="A:AA", index_col=None).reset_index()
It is almost correct except that the header for first columns are used to name the MultiIndex df.columns and the first column is named `('index', ''). So you must re-create it:
df.columns = pd.MultiIndex.from_tuples([tuple(df.columns.names)]
+ list(df.columns)[1:])
Read separetely the headers
head = pd.read_excel('3x3.xlsx', header=None, sheetname= "Ratings Inputs",
parse_cols ="A:AA", skiprows=1, nrows=2)
df = pd.read_excel(file, header=2, sheetname= "Ratings Inputs",
parse_cols ="A:AA", index_col=None).reset_index()
df.columns = pd.MultiIndex.from_tuples(list(head.transpose().to_records(index=False)))
Have you tried reset_index() :
your_data_frame.reset_index(drop=True,inplace=True)

Read dataframe in pandas skipping first column to read time series data

Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])

pandas add columns when read from a csv file

I want to read from a CSV file using pandas read_csv. The CSV file doesn't have column names. When I use pandas to read the CSV file, the first row is set as columns by default. But when I use df.columns = ['ID', 'CODE'], the first row is gone. I want to add, not replace.
df = pd.read_csv(CSV)
df
a 55000G707270
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
df.columns=['ID','CODE']
df
ID CODE
0 b 5l0000D35270
1 c 5l0000D63630
2 d 5l0000G45630
3 e 5l000G191200
4 f 55000G703240
I think you need parameter names in read_csv:
df = pd.read_csv(CSV, names=['ID','CODE'])
names : array-like, default None
List of column names to use. If file contains no header row, then you should explicitly pass header=None. Duplicates in this list are not allowed unless mangle_dupe_cols=True, which is the default.
You may pass the column names at the time of reading the csv file itself as :
df = pd.read_csv(csv_path, names = ["ID", "CODE"])
Use names argument in function call to add the columns yourself:
df = pd.read_csv(CSV, names=['ID','CODE'])
you need both: header=None and names=['ID','CODE'], because there are no column names/labels/headers in your CSV file:
df = pd.read_csv(CSV, header=None, names=['ID','CODE'])
The reason there are extra index columns add is because to_csv() writes an index per default, so you can either disable index when saving your CSV:
df.to_csv('file.csv', index=False)
or you can specify an index column when reading:
df = pd.read_csv('file.csv', index_col=0)

Categories

Resources