using first row as heading in data frame - python

u_cols = ["Inv. Date","Customer Name","Model","Variant"]
users=pd.read_csv('5ch.xls.Sheet1.cvs', sep=',', names=u_cols)
There are times when I do not know the list of column headings. Is there any way to tell the data frame to use the first row as names?

It does so by default. The header parameter to read_csv defaults to 0 (the first row of the given CSV file), unless you pass the names argument, which you have.
In your case, I don't see why you can't simply write
users=pd.read_csv('5ch.xls.Sheet1.cvs')
given that sep defaults to a comma.

Related

I need to add headers to file , but I am losing first row of data after using `df.columns` in Pandas? [duplicate]

Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).

How to fix data frame indexing issue in Pandas? [duplicate]

Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).

Remove rows containing blank space in python data frame

I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.

Pandas read in table without headers

Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).

read_table pandas python numeric error

I am doing a basic pd.read_table of a .txt file. The first column is a list of cusips. The cusip "65248E10" is being read as a number 65248E10 = 652480000000000 (E10 as scientific notation).
I have been going through the pandas but I can't figure out how to require it to stay as a character. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_table.html#pandas.io.parsers.read_table
Also, even if I put header = 0, it seems to be putting the first row as the headers and then row 0 is the second row and so on. If my text file has no column names, how can I get that to default to NULL (or 1, 2, 3, etc.)
Thanks for the help. I am new to pandas/python
If we have a data file which looks like
65248E10 11
55555E55 22
then we can read it in with something like
>>> pd.read_table("cusip.txt", header=None, delimiter=" ", converters={0: str})
0 1
0 65248E10 11
1 55555E55 22
where we use header=None to tell it that there aren't any headers, we use delimiter=" " to tell it there's a space delimiter (adjust to match your data format), and converters={0: str} to tell it that after reading the first column in as a string, we want to turn it into a string (i.e. in this case do nothing to it) rather than process it further. Instead of converters={0: str}, dtype=(str, int) would have worked too, but this way we can still let pandas figure out what the other columns are.
The problem with using header=0 is that 0 here doesn't mean "no header", it means use row number #0 (the first row) as the headers.
To stop your column from being read as a number, use the converters parameter and specify str as the converter for the column containing your "cusips".
For the header, as documented on the page you linked to, header is the number of the row which is to be considered the header; it is not a boolean saying "do I have a header or not. Setting it to zero means to use row zero (i.e., the first row) as the header. The documentation explicitly says:
Specify None if there is no header row.

Categories

Resources