Exporting sorted/adjusted data to excel with python - python

I have a simple dataset that I have sorted with dataframe based on 'category'.
The sorting has gone all well. But now, I'd like to export the sorted/adjusted dataset in .xlsx format. That is the dataset that has been categorized, not the dataset that is read in excel.
I have tried the following:
import pandas as pd
df = pd.read_excel("python_sorting_test.xlsx",index_col=[1])
df.head()
print(df.sort_index(level=['Category'], ascending=True))
df.to_excel (r'C:\Users\Laptop\PycharmProjects\untitled8\export_dataframe.xlsx', header=True)
The issue: It doesn't doesn't store the sorted/adjusted dataset.

Actually, you doesn't save results of sort_index. You can add inplace=True
print(df.sort_index(level=['Category'], ascending=True, inplace=True))
or save results of df.sort_index
df = df.sort_index(level=['Category'], ascending=True)

Related

Processing excel sheet in python

I have an excel sheet with a list of experiments, as shown in the picture below. How can I access specific rows and columns to find the mean and std dev? I am able to load the excel file and read the data using pandas, but I am not sure where to go from there. Ideally, the code can process sheets with many experiment results listed.
Excel input
For output, I would like a table summarizing the results, as shown in the picture below:
Result Summary
I am not sure if this is efficient solution, but this will do.
import pandas as pd
df = pd.read_excel("~/Desktop/delete.xlsx", header=None).T
df.dropna(axis=1, inplace = True)
df.rename(columns=df.iloc[0], inplace = True)
df.drop(df.index[0], inplace = True)
cols = df.columns.unique()
df1 = pd.DataFrame(df.values.reshape(3, 8), columns=cols)
df1['mean'] = df1[df1.columns[~df1.columns.isin(['Test','Sample', 'Site'])]].mean(axis=1)
df1['std'] = df1[df1.columns[~df1.columns.isin(['Test','Sample', 'Site'])]].std(axis=1)
print(df1[['Sample', 'mean', 'std']])

Resampling many timeseries files with pandas/dask

I have many .csv files with two columns. One with timestamps and the other with values. The data is sampled on seconds. What I would like to do is
read all files
set index on time column
resample on hours
save to new files (parquet, hdf,...)
1) Only dask
I tried to use dask's read_csv.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv(
"../data_*.csv",
parse_dates = [0],
date_parser = lambda x: pd.to_datetime(float(x)),
)
So far that's fine. The problem is that I cannot df.resample("min").mean() directly, because the index of dask data frame is not properly set.
After calling dd.reset_index().set_index("timestamp") it works - BUT I cannot afford to do this because it is expensive.
2) Workaround with pandas and hdf files
Another approach was to save all csv files to hdf files using pandas. In this case the pandas dataframes were already indexed by time.
df= dd.read_hdf("/data_01.hdf", key="data")
# This doesn't work directly
# df = df.resample("min").mean()
# Error: "Can only resample dataframes with known divisions"
df = df.reset_index().set_index("timestamp") # expensive! :-(
df = df.resample("min").mean() # works!
Of course this works but it would be extremely expensive on dd.read_hdf("/data_*.hdf", key="data").
How can I directly read timeseries data in dask that it is properly partitioned and indexed?
Do you have any tips or suggestions?
Exmpample Data:
import dask
df = dask.datasets.timeseries()
df.to_hdf("dask.hdf", "data")
# Doesn't work!
# dd.read_hdf("dask.hdf", key="data").resample("min").mean()
# Works!
dd.read_hdf("dask.hdf", key="data").reset_index().set_index("timestamp").resample(
"min"
).mean()
Can you try something like:
pd.read_csv('data.csv', index_col='timestamp', parse_dates=['timestamp']) \
.resample('T').mean().to_csv('data_1min.csv') # or to_hdf(...)

Multiindex Dataframe, Pandas

I am trying to manipulate a data from excel file, however it has merged heading for columns, I managed to transform them in pandas. Please see example of original data below.
So I transformed to this format.
and my final goal is to get the format below and plot brand items and their sales quantity and prices over the period, however I don't know how to access info in multiindex dataframe. Could you please suggest something. Thanks.
My code:
import pandas as pd
df = pd.read_excel('path.xls', sheet_name = 'data', header = [0,1])
a = df.columns.get_level_values(0).to_series()
b = a.mask(a.str.startswith('Unnamed')).fillna('')
df.columns = [b, df.columns.get_level_values(1)]
df.drop(0, inplace=True)
try pandas groupby or pivot_table. The pivot table include index, columns, values and aggfunc. It really nice for summarizing data.

Drop column using Dask dataframe

This should work:
raw_data.drop('some_great_column', axis=1).compute()
But the column is not dropped. In pandas I use:
raw_data.drop(['some_great_column'], axis=1, inplace=True)
But inplace does not exist in Dask. Any ideas?
You can separate into two operations:
# dask operation
raw_data = raw_data.drop('some_great_column', axis=1)
# conversion to pandas
df = raw_data.compute()
Then export the Pandas dataframe to a CSV file:
df.to_csv(r'out.csv', index=False)
I assume you want to keep "raw data" in a Dask DF. In that case the following will do the trick:
new_raw_df = raw_data.drop('some_great_column', axis=1).copy()
where type(new_raw_df) is dask.dataframe.core.DataFrame and you can delete the original DF.

Pandas Data Frame saving into csv file

I wonder how to save a new pandas Series into a csv file in a different column. Suppose I have two csv files which both contains a column as a 'A'. I have done some mathematical function on them and then create a new variable as a 'B'.
For example:
data = pd.read_csv('filepath')
data['B'] = data['A']*10
# and add the value of data.B into a list as a B_list.append(data.B)
This will continue until all of the rows of the first and second csv file has been reading.
I would like to save a column B in a new spread sheet from both csv files.
For example I need this result:
colum1(from csv1) colum2(from csv2)
data.B.value data.b.value
By using this code:
pd.DataFrame(np.array(B_list)).T.to_csv('file.csv', index=False, header=None)
I won't get my preferred result.
Since each column in a pandas DataFrame is a pandas Series. Your B_list is actually a list of pandas Series which you can cast to DataFrame() constructor, then transpose (or as #jezrael shows a horizontal merge with pd.concat(..., axis=1))
finaldf = pd.DataFrame(B_list).T
finaldf.to_csv('output.csv', index=False, header=None)
And should csv have different rows, unequal series are filled with NANs at corresponding rows.
I think you need concat column from data1 with column from data2 first:
df = pd.concat(B_list, axis=1)
df.to_csv('file.csv', index=False, header=None)

Categories

Resources