Having problems when performing calculations in a Pandas dataframe... Here's a sample CSV (see picture):
My problem is that since it reads the row in italics e.g. Data Type (row 2), it treats all values as strings instead of their correct data type i.e. float, degrees, etc. Is there a way I can get it to ignore this row when reading the CSV e.g.
df = pd.read_CSV('sample CSV', ignore row 2)
That way it will read in like this (see other picture) and assume correct data types:
You can pass a list to skiprows to skip only that row. From the docs:
skiprows : list-like or integer or callable, default None
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
Try:
pd.read_csv('my.csv', skiprows=[1])
Beware that python starts counting from 0. So that column (in python) in 1
You can use skiprows=[0], for more details you can refer to documentation:
df = pd.read_csv('Your Filename', skiprows=[0])
You should use the header argument in read_CSV. For example:
df = pd.read_CSV('sample CSV', header=2)
This won't assign the names in the first row. You could probably achieved this by passing the column names manually:
df = pd.read_CSV('sample CSV', header=2, names=['UTC Time', 'Value1', 'Value2', 'Value3'])
You could even get the names programmatically:
with open('filename.csv', 'r') as fh:
names = fr.readline().split(',')
df = pd.read_CSV('sample CSV', header=2, names=names)
(all the code is untested)
Related
Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).
I have an Excel file with numbers (integers) in some rows of the first column (A) and text in all rows of the second column (B):
I want to clean this text, that is I want to remove tags like < b r > (without spaces). My current approach doesn't seem to work:
file_name = "F:\Project\comments_all_sorted.xlsx"
import pandas as pd
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.replace('<br>', '')
clean_df.to_excel('output.xlsx')
What this code does (which I don't want it to do) is it adds running numbers in the first column (A), replacing also the few numbers that were already there, and it adds a first row with '1' in second column of this row (cell 1B):
I'm sure there's an easy way to solve my problem and I'm just not trained enough to see it.
Thanks!
Try this:
df['column_name'] = df['column_name'].str.replace(r'<br>', '')
The index in the output file can be turned off with index=False in the df.to_excel function, i.e,
clean_df.to_excel('output.xlsx', index=False)
As far as I'm aware, you can't use .replace on an entire dataframe. You need to explicitly call out the column. In this case, I just iterate through all columns in case there are more than just the one column.
To get rid of the first column with the sequential numbers (that's the index of the dataframe), add the parameter index=False. The number 1 on the top is the column name. To get rid of that, use header=False
import pandas as pd
file_name = "F:\Project\comments_all_sorted.xlsx"
df = pd.read_excel(file_name, header=None, index_col=None, usecols='B') # specify that there's no header and no column for row labels, use only column B (which includes the text)
clean_df = df.copy()
for col in clean_df.columns:
clean_df[col] = df[col].str.replace('<br>', '')
clean_df.to_excel('output.xlsx', index=False, header=False)
Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).
Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).
I read an Excel Sheet into a pandas DataFrame this way:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet1")
the first cell's value of each column is selected as the column name for the dataFrame, I want to specify my own column names, How do I do this?
This thread is 5 years old and outdated now, but still shows up on the top of the list from a generic search. So I am adding this note. Pandas now (v0.22) has a keyword to specify column names at parsing Excel files. Use:
import pandas as pd
xl = pd.ExcelFile("Path + filename")
df = xl.parse("Sheet 1", header=None, names=['A', 'B', 'C'])
If header=None is not set, pd seems to consider the first row as header and delete it during parsing. If there is indeed a header, but you dont want to use it, you have two choices, either (1) use "names" kwarg only; or (2) use "names" with header=None and skiprows=1. I personally prefer the second option, since it clearly makes note that the input file is not in the format I want, and that I am doing something to go around it.
I think setting them afterwards is the only way in this case, so if you have for example four columns in your DataFrame:
df.columns = ['W','X','Y','Z']
If you know in advance what the headers in the Excelfile are its probably better to rename them, this would rename W into A, etc:
df.rename(columns={'W':'A', 'X':'B', etc})
As Ram said, this post comes on the top and may be useful to some....
In pandas 0.24.2 (may be earlier as well), read_excel itself has the capability of ignoring the source headers and giving your own col names and few other good controls:
DID = pd.read_excel(file1, sheet_name=0, header=None, usecols=[0, 1, 6], names=['A', 'ID', 'B'], dtype={2:str}, skiprows=10)
# for example....
# usecols => read only specific col indexes
# dtype => specifying the data types
# skiprows => skip number of rows from the top.
call .parse with header=None keyword argument.
df = xl.parse("Sheet1", header=None)
In case the excel sheet only contains the data without headers:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"])
In case the excel sheet already contains header names, then use skiprows to skip the line:
df=pd.read_excel("the excel file",header=None,names=["A","B","C"],skiprows=1)