Store multi-index pandas dataframe with hdf5 table format - python

I just came across this issue when adding a multi-index to my pandas dataframe. I am using the pandas HDFStore with the option format='table', which I prefer because the saved data frame is easier to understand and load when not using pandas. (For details see this SO answer: Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers .)
But I ran into a problem because I was setting the multi-index using drop=False when calling set_index, which keeps the index columns as dataframe columns. This was fine until I put the dataframe to the store using format='table'. Using format='fixed' worked fine. But format='table' gave me an error with duplicate column names. I avoided the error by dropping the redundant columns before putting and restoring the columns after getting.
Here is the write/read pair of functions that I now use:
def write_df_without_index_columns(store, name, df):
if isinstance(df.index, pd.MultiIndex):
# drop any columns that are duplicates of index columns
redundant_columns = set(df.index.names).intersection(set(df.columns))
if redundant_columns:
df = df.copy(deep=True)
df.drop(list(redundant_columns), axis=1, inplace=True)
store.put(name, df,
format='table',
data_columns=True)
def read_df_add_index_columns(store, name, default_value):
df = store.get(name)
if isinstance(df.index, pd.MultiIndex):
# remember the MultiIndex column names
index_columns = df.index.names
# put the MultiIndex columns into the data frame
df.reset_index(drop=False, inplace=True)
# now put the MultiIndex columns back into the index
df.set_index(index_columns, drop=False, inplace=True)
return df
My question: is there a better way to do this? I expect to have a data frame with millions of rows, so I do not want this to be too inefficient.

Related

Multiindex Dataframe, Pandas

I am trying to manipulate a data from excel file, however it has merged heading for columns, I managed to transform them in pandas. Please see example of original data below.
So I transformed to this format.
and my final goal is to get the format below and plot brand items and their sales quantity and prices over the period, however I don't know how to access info in multiindex dataframe. Could you please suggest something. Thanks.
My code:
import pandas as pd
df = pd.read_excel('path.xls', sheet_name = 'data', header = [0,1])
a = df.columns.get_level_values(0).to_series()
b = a.mask(a.str.startswith('Unnamed')).fillna('')
df.columns = [b, df.columns.get_level_values(1)]
df.drop(0, inplace=True)
try pandas groupby or pivot_table. The pivot table include index, columns, values and aggfunc. It really nice for summarizing data.

Pandas data frame not allowing me to drop first empty column in python?

I have read in some data from a csv, and there were a load of spare columns and rows that were not needed. I've managed to get rid of most of them, but the first column is showing as an NaN and will not drop despite several attempts. This means I cannot promote the titles in row 0 to headers. I have tried the below:
df = pd.read_csv(("List of schools.csv"))
df = df.iloc[3:]
df.dropna(how='all', axis=1, inplace =True)
df.head()
But I am still getting this returned:
Any help please? I'm a newbie
You can improve your read_csv() operation.
Avloss can tell your "columns" are indices because they are bold. Looking at your output, there are two things of note.
The "columns" are bold implying that pandas read them in as part of the index of the DataFrame rather than as values
There is no information above the horizontal line at the top indicating there are currently no column names. The top row of the csv file that contains the column names is being read in as values.
To solve your column deletion problem, you should first improve your read_csv() operation by being more explicit. Your current code is placing column headers in the data and placing some of the data in the indicies. Since you have the operation df = df.iloc[3:] in your code, I'm assuming the data in your csv file doesn't start until the 4th row. Try this:
header_row = 3 #or 4 - I am bad at zero-indexing
df = pd.read_csv('List of schools.csv', header=header_row, index_col=False)
df.dropna(how='all', axis=1, inplace =True)
This code should read the column names in as column names and not index any of the columns, giving you a cleaner DataFrame to work from when dropping NA values.
those aren't columns, those are indices. You can convert them to columns by doing
df = df.reset_index()

Drop column using Dask dataframe

This should work:
raw_data.drop('some_great_column', axis=1).compute()
But the column is not dropped. In pandas I use:
raw_data.drop(['some_great_column'], axis=1, inplace=True)
But inplace does not exist in Dask. Any ideas?
You can separate into two operations:
# dask operation
raw_data = raw_data.drop('some_great_column', axis=1)
# conversion to pandas
df = raw_data.compute()
Then export the Pandas dataframe to a CSV file:
df.to_csv(r'out.csv', index=False)
I assume you want to keep "raw data" in a Dask DF. In that case the following will do the trick:
new_raw_df = raw_data.drop('some_great_column', axis=1).copy()
where type(new_raw_df) is dask.dataframe.core.DataFrame and you can delete the original DF.

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

Setting a pandas index or transposing

I imported a table with 30 columns of data and pandas automatically generated an index for the rows from 0-232. I went to make a new dataframe with only 5 of the columns, using the below code:
df = pd.DataFrame(data=[data['Age'], data['FG'], data['FGA'], data['3P'], data['3PA']])
When I viewed the df the rows and columns had been transposed, so that the index made 232 columns and there were 5 rows. How can I set the index vertically, or transpose the dataframe?
The correct approach is actually much simpler. You just need to pull out the columns simultaneously with a list of column names:
df = data[['Age', 'FG', 'FGA', '3P', '3PA']]
Paul's response is the most preferred way to perform this operation. But as you suggest, you could alternatively transpose the DataFrame after reading it in:
df = df.T

Categories

Resources