I am doing a basic pd.read_table of a .txt file. The first column is a list of cusips. The cusip "65248E10" is being read as a number 65248E10 = 652480000000000 (E10 as scientific notation).
I have been going through the pandas but I can't figure out how to require it to stay as a character. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_table.html#pandas.io.parsers.read_table
Also, even if I put header = 0, it seems to be putting the first row as the headers and then row 0 is the second row and so on. If my text file has no column names, how can I get that to default to NULL (or 1, 2, 3, etc.)
Thanks for the help. I am new to pandas/python
If we have a data file which looks like
65248E10 11
55555E55 22
then we can read it in with something like
>>> pd.read_table("cusip.txt", header=None, delimiter=" ", converters={0: str})
0 1
0 65248E10 11
1 55555E55 22
where we use header=None to tell it that there aren't any headers, we use delimiter=" " to tell it there's a space delimiter (adjust to match your data format), and converters={0: str} to tell it that after reading the first column in as a string, we want to turn it into a string (i.e. in this case do nothing to it) rather than process it further. Instead of converters={0: str}, dtype=(str, int) would have worked too, but this way we can still let pandas figure out what the other columns are.
The problem with using header=0 is that 0 here doesn't mean "no header", it means use row number #0 (the first row) as the headers.
To stop your column from being read as a number, use the converters parameter and specify str as the converter for the column containing your "cusips".
For the header, as documented on the page you linked to, header is the number of the row which is to be considered the header; it is not a boolean saying "do I have a header or not. Setting it to zero means to use row zero (i.e., the first row) as the header. The documentation explicitly says:
Specify None if there is no header row.
Related
In the pandas docs, for the function read_csv, I'm trying to understand what the following explanation about the behavior of the function is when index_col is set to its default value, None:
The default value of None instructs pandas to guess. If the number of
fields in the column header row is equal to the number of fields in
the body of the data file, then a default index is used. If it is
larger, then the first columns are used as index so that the remaining
number of fields in the body are equal to the number of fields in the
header.
The first row after the header is used to determine the number of
columns, which will go into the index. If the subsequent rows contain
less columns than the first row, they are filled with NaN.
So I came up with the following toy example:
with open("io_tools_example_index.txt", "w") as f:
f.write("pandas, koalas, lizards, kangaroos\n1,2,3\n4,5,6")
When I do pd.read_csv("io_tools_example_index.txt"), I get:
whereas based on their explanation, I would have expected Pandas to use the pandas column as the index since the number of fields in the column header is larger than the number of fields in the remaining lines. What am I missing here?
It's ambiguous, but the "it" in "If it is larger" refers to the number of fields in the body of the data file rather than number of column header fields. If you had a CSV file named foo2.csv with the contents
pandas, koalas
1,2,3
4,5,6
then "1" and "4" would be used as the indices of the rows in the body, so running pd.read_csv("foo2.csv") would get you this:
pandas koalas
1 2 3
4 5 6
Using pandas, how do I read in only a subset of the columns (say 4th and 7th columns) of a .csv file with no headers? I cannot seem to be able to do so using usecols.
In order to read a csv in that doesn't have a header and for only certain columns you need to pass params header=None and usecols=[3,6] for the 4th and 7th columns:
df = pd.read_csv(file_path, header=None, usecols=[3,6])
See the docs
Previous answers were good and correct, but in my opinion, an extra names parameter will make it perfect, and it should be the recommended way, especially when the csv has no headers.
Solution
Use usecols and names parameters
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'])
Additional reading
or use header=None to explicitly tells people that the csv has no headers (anyway both lines are identical)
df = pd.read_csv(file_path, usecols=[3,6], names=['colA', 'colB'], header=None)
So that you can retrieve your data by
# with `names` parameter
df['colA']
df['colB']
instead of
# without `names` parameter
df[0]
df[1]
Explain
Based on read_csv, when names are passed explicitly, then header will be behaving like None instead of 0, so one can skip header=None when names exist.
Make sure you specify pass header=None and add usecols=[3,6] for the 4th and 7th columns.
As per documentation https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html :
headerint, list of int, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
namesarray-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
columts = ['Day', 'PLMN', 'RNCname']
tempo = pd.read_csv("info.csv", sep=';', header=0, names=columts, index_col=False)
You can also call read_table() with header=None (to read the first row of the file as the first row of the data):
df = pd.read_table('test.tsv', sep=',', usecols=[3,6], header=None)
This function is more useful if the separator is \t (.tsv file etc.) because the default delimiter is \t (unlike read_csv whose default delimiter is ,).
I read in excel files that are normally formatted like this below:
colA colB
0 0
1 1
and I can just write something like df = pd.read_excel(filename, skiprows=0)
which skips the column headers and ingests the data. However sometimes my data comes in as
some random text in the cells above
colA colB
0 0
1 1
where I would need to either delete that extra row manually then shift everything up so that the first row is made up of the column headers. Is there an elegant way to start the excel read at whatever row number colA is found so we skip any unnecessary entries or text above the colA and colB headers?
Assuming you know the first column name (i.e. colA in your example), and that this value will be present somewhere in the first column of data:
if df.columns[0] != "colA": # Check first if column name is incorrect.
# Get the first column of data:
first_col = df[df.columns[0]]
# Identify the row index where the value equals the column name:
header_row_index = first_col.loc[first_col == "colA"].index[0]
# Grab the column names:
column_names = df.loc[header_row_index]
# Reset the df to start below the new header row, and rename the columns:
df = df.loc[header_row_index+1:, :]
df.columns = column_names
I don't quite understand your problem. It looks like you know about skip_rows.
You could just pass a list of row number to do that.
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file.
For example,
rows_to_skip=[0,1,2] #skip first 3 rows of the file
df = pd.read_excel(filename, skiprows=rows_to_skip)
There is also a way to slightly simplify the process. Say, you don't know the exact line where your column headers are. You can use grep to obtain this number in a terminal and just get rid of all rows before that.
For example,grep -n 'colA' filename will return the line where that information is found along with a line number.You could easily than make a list to skip all preceding lines like this rows_to_skip=list(range(line_number)). Not the best possible solution(memory-wise due to list), but it should also work here.
I imported a csv file to Python (Using Python data frame) and there are some missing values in a CSV file. In the data frame I have rows like following
> 08,63.40,86.21,63.12,72.78,,
I have tried everything to remove the rows containing the elements similar to the last element in the above data. Nothing works. I do not know if above is categorized as white space or empty string or what.
Here is what I have:
result = pandas.read_csv(file,sep='delimiter')
result[result!=',,']
This did not work. Then I have done following:
result.replace(' ', np.nan, inplace=True)
result.dropna(inplace=True)
This also did not work.
result = result.replace(r'\s+', np.nan, regex=True)
This also did not work. I still see the row containing the ,, element.
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1.( I do not know if this helps)
Can anyone tell me how to remove rows containing ,, elements?
Also my dataframe is 100 by 1. When I import it from CSV file all the columns become 1
This is probably the key and IMHO is weird. When you import a csv in a pandas DataFrame you normally want each field to go in its own column, precisely to later be able to process that column values individually. So (still IMHO) the correct solution if to fix that.
Now to directly answer your (probably XY question), you do not want to remove rows containing blank or empty columns, because your row only contains one single column, but rows containing consecutive commas(,,). So you should use:
df.drop(df.iloc[0].str.contains(',,').index)
I think your code should work with a minor change:
result.replace('', np.nan, inplace=True)
result.dropna(inplace=True)
In case you have several rows in your CSV file, you can avoid the extra conversion step to NaN:
result = pandas.read_csv(file)
result = result[result.notnull().all(axis = 1)]
This will remove any row where there is an empty element.
However, your added comment explains that there is just one row in the CSV file, and it seems that the CSV reader shows some special behavior. Since you need to select the columns without NaN, I suggest these lines:
result = pandas.read_csv(file, header = None)
selected_columns = result.columns[result.notnull().any()]
result = result[selected_columns]
Note the option header = None with read_csv.
u_cols = ["Inv. Date","Customer Name","Model","Variant"]
users=pd.read_csv('5ch.xls.Sheet1.cvs', sep=',', names=u_cols)
There are times when I do not know the list of column headings. Is there any way to tell the data frame to use the first row as names?
It does so by default. The header parameter to read_csv defaults to 0 (the first row of the given CSV file), unless you pass the names argument, which you have.
In your case, I don't see why you can't simply write
users=pd.read_csv('5ch.xls.Sheet1.cvs')
given that sep defaults to a comma.