I have a dataframe grouped by 3 variables. It looks like:
https://i.stack.imgur.com/q8W0y.png
When I export the table to csv, the format changes. I want to keep the original
Any ideas?
Thanks!
Pandas to_csv (and csv in general) does not support the MultiIndex used in your data. As such, it just stores the indices "long" (so each level of the MultiIndex would be a column, and each row would have its index value.) I suspect that's what you are calling "format changes".
The upshot is that if you expect to save a pandas dataframe to csv and then reestablish the dataframe from the csv, then you need to re-index the dataframe to the MultiIndex yourself, after importing it.
Related
I have a few small data frames that I'm outputting to excel on one sheet. To make then fit better, I need to merge some cells in one table, but to write this in xlsx writer, I need to specify the data parameter. I want to keep the data that is already written in the left cell from using the to_excel() bit of code. Is there a way to do this without having to specify the data parameter? Or do I need to lookup the value in the dataframe to put in there.
For example:
df.to_excel(writer, 'sheet') gives similar to the following output:
Then I want to merge across C:D for this table without having to specify what data should be there (because it is already in column C), using something like:
worksheet.merge_range('C1:D1', cell_format = fmat) etc.
to get below:
Is this possible? Or will I need to lookup the values in the dataframe?
Is this possible? Or will I need to lookup the values in the dataframe?
You will need to lookup the data from the dataframe. There is no way in XlsxWriter to write formatting on top of existing data. The data and formatting need to be written at the same time (apart from Conditional Formatting which can't be used for merging anyway).
I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.
I am using pandas and xlsxwriter to export and format a number of dataframes to Excel.
The xlsxwriter documentation mentions that:
http://xlsxwriter.readthedocs.io/worksheet.html?highlight=set_column
A row format takes precedence over a default column format
Precedence means that, if you format column B as percentage, and then row 2 as bold, cell B2 won't be bold and in % - it will be bold only, but not in %!
I have provided an example below. Is there a way around it? Maybe an engine other than xlsxwriter? Maybe some way to apply formatting after exporting the dataframes to Excel?
It makes no difference whether I format the row first and the columns later, or viceversa.
It's not shown in the example below, but in my code I export a number of dataframes, all with the same columns, to the same Excel sheet. The dataframes are the equivalent of an Excel Pivot table, with a 'total' row at the bottom. I'd like the header row and the total row to be bold, and each column to have a specific formatting depending on the data (%, thousands, millions, etc). Sample code below:
import pandas as pd
writer = pd.ExcelWriter('test.xlsx')
wk = writer.book.add_worksheet('Test')
fmt_bold = writer.book.add_format({'bold':True})
fmt_pct = writer.book.add_format({'num_format': '0.0%'})
wk.write(1,1,1)
wk.write(2,1,2)
wk.set_column(1,1, None, fmt_pct)
wk.set_row(1,None, fmt_bold)
writer.close()
As #jmcnamara notes openpyxl provides different formatting options because it allows you essentially to process a dataframe within a worksheet. NB. openpyxl does not support row or column formats.
The openpyxl dataframe_to_rows() function converts a dataframe to a generator of values, row by row allowing you to apply whatever formatting or additional processing you like.
In this case you will need to create another format that is a combination of the row and column formats and apply it to the cell.
In order to do that you will need to iterate over the data frame and call XlsxWriter directly, rather then using the Pandas-Excel interface.
Alternatively, you may be able to do using OpenPyXL as the pandas Excel engine. Recent versions of the Pandas interface added the ability to add formatting to the Excel data after writing the dataframe, when using OpenPyXL.
I imported a csv as a dataframe from San Francisco Salaries database from Kaggle
df=pd.read_csv('Salaries.csv')
I created a dataframe as an aggregate function from 'df'
df2=df.groupby(['JobTitle','Year'])[['TotalPay']].median()
Problem 1: The first and second column appear as nameless and that shouldn't happen.
Even when I use code of
df2.columns
It only names TotalPay as a column
Problem 2: I try to rename, for instance, the first column as JobTitle and the code doesn't do anything
df3=df2.rename(columns = {0:'JobTitle'},inplace=True)
So the solution that was given here does not apparently work: Rename unnamed column pandas dataframe.
I wish two possible solutions:
1) That the aggregate function respects the column naming AND/OR
2) Rename the empty dataframe's columns
The problem isn't really that you need to rename the columns.
What do the first few rows of the .csv file that you're importing look at, because you're not importing it properly. Pandas isn't recognising that JobTitle and Year are meant to be column headers. Pandas read_csv() is very flexible with what it will let you do.
If you import the data properly, you won't need to reindex, or relabel.
Quoting answer by MaxU:
df3 = df2.reset_index()
Thank you!
Specifically,
If I don't need to change the datatype, is it better left alone? Does it copy the whole column of a dataframe? Does it copy the whole dataframe? Or does it just alter some setting in the dataframe to treat the entries in that column as a particular type?
Also, is there a way to set the type of the columns while the dataframe is getting created?
Here is one example "2014-05-25 12:14:01.929000" is cast as a np.datetime64 when the dataframe is created. then I save the dataframe onto a csv. then I read from the csv, and it becomes an arbitrary object. How would I avoid this? Or how can I re-cast this particular column as an np.datetime64 whilst doing a pd.DataFrame.read_csv ....
Thanks.