Multiindex Dataframe, Pandas - python

I am trying to manipulate a data from excel file, however it has merged heading for columns, I managed to transform them in pandas. Please see example of original data below.
So I transformed to this format.
and my final goal is to get the format below and plot brand items and their sales quantity and prices over the period, however I don't know how to access info in multiindex dataframe. Could you please suggest something. Thanks.
My code:
import pandas as pd
df = pd.read_excel('path.xls', sheet_name = 'data', header = [0,1])
a = df.columns.get_level_values(0).to_series()
b = a.mask(a.str.startswith('Unnamed')).fillna('')
df.columns = [b, df.columns.get_level_values(1)]
df.drop(0, inplace=True)

try pandas groupby or pivot_table. The pivot table include index, columns, values and aggfunc. It really nice for summarizing data.

Related

Pandas, I get dataframe full of nan when reading from xlsx

I am reading from an Excel file ".xslx", it's consist of 3 columns, but when I read from it, I get a DF full of nans, I checked the table in Excel, it consists of normal cells no formulas no hyperlinks.
My code:
data = pd.read_excel("Data.xlsx")
df = pd.DataFrame(data, columns=["subreddit_group", "links/caption", "subreddits/flair"])
print(df)
Here is the excel file:
Here is the output:
The column parameter of pd.Dataframe() function doesn't set column names in result dataframe, but selects columns from the original file.
See pandas documentation :
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
So you shouldn't provide column parameter and after the file is read, rename columns of the dataframe:
df = pd.DataFrame(data)
df.columns = ['subreddit_group', 'links/caption', 'def']

Python: Create dataframe with 'uneven' column entries

I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you
If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

Store multi-index pandas dataframe with hdf5 table format

I just came across this issue when adding a multi-index to my pandas dataframe. I am using the pandas HDFStore with the option format='table', which I prefer because the saved data frame is easier to understand and load when not using pandas. (For details see this SO answer: Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers .)
But I ran into a problem because I was setting the multi-index using drop=False when calling set_index, which keeps the index columns as dataframe columns. This was fine until I put the dataframe to the store using format='table'. Using format='fixed' worked fine. But format='table' gave me an error with duplicate column names. I avoided the error by dropping the redundant columns before putting and restoring the columns after getting.
Here is the write/read pair of functions that I now use:
def write_df_without_index_columns(store, name, df):
if isinstance(df.index, pd.MultiIndex):
# drop any columns that are duplicates of index columns
redundant_columns = set(df.index.names).intersection(set(df.columns))
if redundant_columns:
df = df.copy(deep=True)
df.drop(list(redundant_columns), axis=1, inplace=True)
store.put(name, df,
format='table',
data_columns=True)
def read_df_add_index_columns(store, name, default_value):
df = store.get(name)
if isinstance(df.index, pd.MultiIndex):
# remember the MultiIndex column names
index_columns = df.index.names
# put the MultiIndex columns into the data frame
df.reset_index(drop=False, inplace=True)
# now put the MultiIndex columns back into the index
df.set_index(index_columns, drop=False, inplace=True)
return df
My question: is there a better way to do this? I expect to have a data frame with millions of rows, so I do not want this to be too inefficient.

Setting a pandas index or transposing

I imported a table with 30 columns of data and pandas automatically generated an index for the rows from 0-232. I went to make a new dataframe with only 5 of the columns, using the below code:
df = pd.DataFrame(data=[data['Age'], data['FG'], data['FGA'], data['3P'], data['3PA']])
When I viewed the df the rows and columns had been transposed, so that the index made 232 columns and there were 5 rows. How can I set the index vertically, or transpose the dataframe?
The correct approach is actually much simpler. You just need to pull out the columns simultaneously with a list of column names:
df = data[['Age', 'FG', 'FGA', '3P', '3PA']]
Paul's response is the most preferred way to perform this operation. But as you suggest, you could alternatively transpose the DataFrame after reading it in:
df = df.T

Categories

Resources