Drop column using Dask dataframe - python

This should work:
raw_data.drop('some_great_column', axis=1).compute()
But the column is not dropped. In pandas I use:
raw_data.drop(['some_great_column'], axis=1, inplace=True)
But inplace does not exist in Dask. Any ideas?

You can separate into two operations:
# dask operation
raw_data = raw_data.drop('some_great_column', axis=1)
# conversion to pandas
df = raw_data.compute()
Then export the Pandas dataframe to a CSV file:
df.to_csv(r'out.csv', index=False)

I assume you want to keep "raw data" in a Dask DF. In that case the following will do the trick:
new_raw_df = raw_data.drop('some_great_column', axis=1).copy()
where type(new_raw_df) is dask.dataframe.core.DataFrame and you can delete the original DF.

Related

converting columns into rows in pd. dataframe

How can I convert columns to rows in a pd.dataframe, currently my code is as below in, instead of having my values returned in columns I want them to be displayed in rows, I have tried using iterrows:
df = pd.DataFrame (columns = cleaned_array)
output = df.to_csv ( index=False, mode='a', encoding = "utf-8")
print(output)
Try this:
df = pd.DataFrame (columns = cleaned_array)
df.T
This will interchange your rows and columns
You want to use the tranpose function.
df.T or df.transpose()

Multiindex Dataframe, Pandas

I am trying to manipulate a data from excel file, however it has merged heading for columns, I managed to transform them in pandas. Please see example of original data below.
So I transformed to this format.
and my final goal is to get the format below and plot brand items and their sales quantity and prices over the period, however I don't know how to access info in multiindex dataframe. Could you please suggest something. Thanks.
My code:
import pandas as pd
df = pd.read_excel('path.xls', sheet_name = 'data', header = [0,1])
a = df.columns.get_level_values(0).to_series()
b = a.mask(a.str.startswith('Unnamed')).fillna('')
df.columns = [b, df.columns.get_level_values(1)]
df.drop(0, inplace=True)
try pandas groupby or pivot_table. The pivot table include index, columns, values and aggfunc. It really nice for summarizing data.

Exporting sorted/adjusted data to excel with python

I have a simple dataset that I have sorted with dataframe based on 'category'.
The sorting has gone all well. But now, I'd like to export the sorted/adjusted dataset in .xlsx format. That is the dataset that has been categorized, not the dataset that is read in excel.
I have tried the following:
import pandas as pd
df = pd.read_excel("python_sorting_test.xlsx",index_col=[1])
df.head()
print(df.sort_index(level=['Category'], ascending=True))
df.to_excel (r'C:\Users\Laptop\PycharmProjects\untitled8\export_dataframe.xlsx', header=True)
The issue: It doesn't doesn't store the sorted/adjusted dataset.
Actually, you doesn't save results of sort_index. You can add inplace=True
print(df.sort_index(level=['Category'], ascending=True, inplace=True))
or save results of df.sort_index
df = df.sort_index(level=['Category'], ascending=True)

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

Store multi-index pandas dataframe with hdf5 table format

I just came across this issue when adding a multi-index to my pandas dataframe. I am using the pandas HDFStore with the option format='table', which I prefer because the saved data frame is easier to understand and load when not using pandas. (For details see this SO answer: Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers .)
But I ran into a problem because I was setting the multi-index using drop=False when calling set_index, which keeps the index columns as dataframe columns. This was fine until I put the dataframe to the store using format='table'. Using format='fixed' worked fine. But format='table' gave me an error with duplicate column names. I avoided the error by dropping the redundant columns before putting and restoring the columns after getting.
Here is the write/read pair of functions that I now use:
def write_df_without_index_columns(store, name, df):
if isinstance(df.index, pd.MultiIndex):
# drop any columns that are duplicates of index columns
redundant_columns = set(df.index.names).intersection(set(df.columns))
if redundant_columns:
df = df.copy(deep=True)
df.drop(list(redundant_columns), axis=1, inplace=True)
store.put(name, df,
format='table',
data_columns=True)
def read_df_add_index_columns(store, name, default_value):
df = store.get(name)
if isinstance(df.index, pd.MultiIndex):
# remember the MultiIndex column names
index_columns = df.index.names
# put the MultiIndex columns into the data frame
df.reset_index(drop=False, inplace=True)
# now put the MultiIndex columns back into the index
df.set_index(index_columns, drop=False, inplace=True)
return df
My question: is there a better way to do this? I expect to have a data frame with millions of rows, so I do not want this to be too inefficient.

Categories

Resources