First row of data has become a column in Pandas table - python

The first row in pandas data table has turned into a column. I've tried various renaming methods and restructuring and it hasn't been working. Something really trivial, but unfortunately I need some help.
The line "0" is supposed to come down as the first data row "Bachelor". Could someone please point me to the proper way of getting this done?

I think there is problem your csv have no header, so is possible create default range columns names:
df_degree = pd.read_csv(file, header=None)
Or is possible define custom columns names:
df_degree = pd.read_csv(file, names=['col1','col2'])

Related

Pandas DataFrame 'Date' Column has no header / dtypes / not listed in df.columns

I'm importing data from nasdaqdatalink api
Two questions from this:
(1) How is this already a Pandas DataFrame without me needing to type df = pd.DataFrame ?
(2) The 'Date' column, doesn't appear to be a DataFrame column? if I try df.columns it doesn't show up in the index and obviously has no header. So I am confused on what's happening here.
Essentially, I wanted to select data from this DataFrame between two dates, but the only way I really know how to do that is by selecting the column name first. However, I'm missing something here. I tried to rename the column in position [0] but that just created a new column named 'Date' with NaN values.
What am I not understanding? (I've only just begun learning Python, Pandas etc. ~1 month ago ! so this is about as far as I could go on my own without more help)
screenshot
There's actually a better way, by keeping Date as the index, see the output of:
df.loc['2008-01-01':'2009-01-01']
df.reset_index() makes whatever the current index is into a column.

How can I read a document with pandas (python) that don't look like the average one?

I am trying to get the values from the colums from a file. The doc looks like this
the data I want to read
So all the examples I have found uses pd.read_csv or pd.DataFrame but all data usually have a clear header an nothing on top of the data (I have like 10 lines of code I don't really need for what I am doing).
Also, I think maybe there is something wrong because I tried to run:
data = pd.read_csv('tdump_BIL_100_-96_17010100.txt',header=15)
and I get
pd.read_csv output
which is just the row in one column, so there is no separation apparently, and therefore no way of getting the columns I need.
So my question is if there is a way to get the data from this file with pandas and how to get it.
If a defined number, skip the initial rows, indicate that no header is present, and that values are separated by spaces.
df = pd.read_csv(filename, skiprows=15, header=None, sep='\s+')
See read_csv for documentation.

Merging and cleaning up csv files in Python

I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!

How to iterate row by row in a pandas dataframe and look for a value in its columns

I must read each row of an excel file and preform calculations based on the contents of each row. Each row is divided in columns, my problem is that I cannot find a way to access the contents of those columns.
I'm reading the rows with:
for i in df.index,:
print(df.loc[i])
Which works well, but when I try to access, say, the 4h column with this type of indexing I get an error:
for i in df.index,:
print(df.loc[i][3])
I'm pretty sure I'm approaching the indexing issue in the wrong way, but I cannot figure put how to solve it.
You can use iterrows(), like in the following code:
for index, row in dataFrame.iterrows():
print(row)
But this is not the most efficient way to iterate over a panda DataFrame, more info at this post.

Python read_csv - ParserError: Error tokenizing data

I understand why I get this error when trying to df = pd.read_csv(file) :
ParserError: Error tokenizing data. C error: Expected 14 fields in line 7, saw 30
When it reads in the csv, it sees 14 strings/columns in the first row, based on the first row of the csv calls it the headers (which is what I want).
However, those columns are extended further, down the rows (specifially when it gets to row 7).
I can find solutions that will read it in by skipping those rows 1-6, but I don't want that. I still want the whole csv to be read, but instead of the header being 14 columns, how can I tell it make the header 30 columns, and if there is no text/string, just leave the column as a "", or null, or some random numbering. In other words, I don't care what it's named, I just need the space holder so it can parse after row 6.
I'm wondering is there a way to read in the csv, and explicitly say there are 30 columns but have not found a solution.
I can throw some random solutions that I think should work.
1) Set Header=None and give columns names in 'Name' attribute of read_csv.
df=pd.read_csv(file, header=None, namees = [field1, field2, ...., field 30])
PS. This will work if your CSV doesn't have a header already.
2) Secondly you can try using below command (if your csv already has header row)
df=pd.read_csv(file, usecols=[0,1,2,...,30])
Let me know if this works out for you.
Thanks,
Rohan Hodarkar
what about trying, to be noted error_bad_lines=False will cause the offending lines to be skipped
data = pd.read_csv('File_path', error_bad_lines=False)
Just few more collectives answers..
It might be an issue with the delimiters in your data the first row,
To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,
df = pandas.read_csv('File_path', sep='delimiter', header=None)
In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Here Documenet says: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.
According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
This might be an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.
data=pd.read_csv("File_path", sep='\t')
OR
pandas.read_csv('File_path',header=None,sep=', ')

Categories

Resources