Pandas, I get dataframe full of nan when reading from xlsx - python

I am reading from an Excel file ".xslx", it's consist of 3 columns, but when I read from it, I get a DF full of nans, I checked the table in Excel, it consists of normal cells no formulas no hyperlinks.
My code:
data = pd.read_excel("Data.xlsx")
df = pd.DataFrame(data, columns=["subreddit_group", "links/caption", "subreddits/flair"])
print(df)
Here is the excel file:
Here is the output:

The column parameter of pd.Dataframe() function doesn't set column names in result dataframe, but selects columns from the original file.
See pandas documentation :
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
So you shouldn't provide column parameter and after the file is read, rename columns of the dataframe:
df = pd.DataFrame(data)
df.columns = ['subreddit_group', 'links/caption', 'def']

Related

How to unmerge multiple cells and transpose each value into a new column in Pandas dataframe from Excel file

This has 4 to 5 merged cells, blank cells. I need to get them in a format where a column is created for each merged cell.
Please find the link for Sample file and also the output required below.
Link : https://docs.google.com/spreadsheets/d/1fkG9YT9YGg9eXYz6yTkl7AMeuXwmNYQ44oXgA8ByMPk/edit?usp=sharing
Code so far:
dfs = pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/TVS/sample.xlsx',sheet_name=None, header=[0,4], index_col=[0,1])
df_2 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand','A_B','Foo_Foos','Cur','Units'], columns=None)
.reset_index())
How do I create multiple column in the above code? Right now I'm getting the below error:
ValueError: Length of names must match number of levels in MultiIndex.
Try this:
df = pd.read_excel("Sample_File.xlsx", header=[0,1,2,3,4,5], index_col = [0,1])
df.stack(level=[0,1,2,3,4])

Reset labels in Pandas DataFrame, Python

I have a csv file with a wrong first row data. The names of labels are in the row number 2. So when I am storing this file to the DataFrame the names of labels are incorrect. And correct names become values of the row 0. Is there any function similar to reset_index() but for columns? PS I can not change csv file. Here is an image for better understanding. DataFrame with wrong labels
Hello let's suppose you csv file is data.csv :
Try this code:
import pandas as pd
#reading the csv file
df = pd.read_csv('data.csv')
#changing the headers name to integers
df.columns = range(df.shape[1])
#saving the data in another csv file
df.to_csv('data_without_header.csv',header=None,index=False)
#reading the new csv file
new_df = pd.read_csv('data_without_header.csv')
#plotting the new data
new_df.head()
If you do not care about the rows preceding your column names, you can pass in the "header" argument with the value of the correct row, for example if the proper column names are in row 2:
df = pd.read_csv('my_csv.csv', header=2)
Keep in mind that this will erase the previous rows from the DataFrame. If you still want to keep them, you can do the following thing:
df = pd.read_csv('my_csv.csv')
df.columns = df.iloc[2, :] # replace columns with values in row 2
Cheers.

Pandas dataframe read_excel does not consider blank upper left cells as columns?

I'm trying to read an Excel or CSV file into pandas dataframe. The file will read the first two columns only, and the top row of the first two columns will be the column names. The problem is when I have the first column of the top row empty in the Excel file.
IDs
2/26/2010 2
3/31/2010 4
4/31/2010 2
5/31/2010 2
Then, the last line of the following code fails:
uploaded_file = request.FILES['file-name']
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1])
else:
df = pd.read_excel(uploaded_file, usecols=[0,1])
ref_date = 'ref_date'
regime_tag = 'regime_tag'
df.columns = [ref_date, regime_tag]
Apparently, it only reads one column (i.e. the IDs). However, with read_csv, it reads both column, with the first column being unnamed. I want it to behave that way and read both columns regardless of whether the top cells are empty or filled. How do I go about doing that?
What's happening is the first "column" in the Excel file is being read in as an index, while in the CSV file it's being treated as a column / series.
I recommend you work the other way and amend pd.read_csv to read the first column as an index. Then use reset_index to elevate the index to a series:
if uploaded_file.name.endswith('.csv'):
df = pd.read_csv(uploaded_file, usecols=[0,1], index_col=0)
else:
df = pd.read_excel(uploaded_file, header=[0,1], usecols=[0,1])
df = df.reset_index() # this will elevate index to a column called 'index'
This will give consistent output, i.e. first series will have label 'index' and the index of the dataframe will be the regular pd.RangeIndex.
You could potentially use a dispatcher to get rid of the unwieldy if / else construct:
file_flag = {True: pd.read_csv, False: pd.read_excel}
read_func = file_flag[uploaded_file.name.endswith('.csv')]
df = read_func(uploaded_file, usecols=[0,1], index_col=0).reset_index()

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

How to delete or drop the column labelled "index" from a dataframe when using to_csv() to save as csv

I am reading a csv file, cleaning it up a little, and then saving it back to a new csv file. The problem is that the new csv file has a new column (first column in fact), labelled as index. Now this is not the row index, as I have turned that off in the to_csv() function as you can see in the code. Plus row index doesn't have a column label as well.
df = pd.read_csv('D1.csv', na_values=0, nrows = 139) # Read csv, with 0 values converted to NaN
df = df.dropna(axis=0, how='any') # Delete any rows containing NaN
df = df.reset_index()
df.to_csv('D1Clean.csv', index=False)
Any ideas where this phantom column is coming from and how to get rid of it?
I think you need add parameter drop=True to reset_index:
df = df.reset_index(drop=True)
drop : boolean, default False
Do not try to insert index into dataframe columns. This resets the index to the default integer index.

Categories

Resources