Resampling many timeseries files with pandas/dask - python

I have many .csv files with two columns. One with timestamps and the other with values. The data is sampled on seconds. What I would like to do is
read all files
set index on time column
resample on hours
save to new files (parquet, hdf,...)
1) Only dask
I tried to use dask's read_csv.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv(
"../data_*.csv",
parse_dates = [0],
date_parser = lambda x: pd.to_datetime(float(x)),
)
So far that's fine. The problem is that I cannot df.resample("min").mean() directly, because the index of dask data frame is not properly set.
After calling dd.reset_index().set_index("timestamp") it works - BUT I cannot afford to do this because it is expensive.
2) Workaround with pandas and hdf files
Another approach was to save all csv files to hdf files using pandas. In this case the pandas dataframes were already indexed by time.
df= dd.read_hdf("/data_01.hdf", key="data")
# This doesn't work directly
# df = df.resample("min").mean()
# Error: "Can only resample dataframes with known divisions"
df = df.reset_index().set_index("timestamp") # expensive! :-(
df = df.resample("min").mean() # works!
Of course this works but it would be extremely expensive on dd.read_hdf("/data_*.hdf", key="data").
How can I directly read timeseries data in dask that it is properly partitioned and indexed?
Do you have any tips or suggestions?
Exmpample Data:
import dask
df = dask.datasets.timeseries()
df.to_hdf("dask.hdf", "data")
# Doesn't work!
# dd.read_hdf("dask.hdf", key="data").resample("min").mean()
# Works!
dd.read_hdf("dask.hdf", key="data").reset_index().set_index("timestamp").resample(
"min"
).mean()

Can you try something like:
pd.read_csv('data.csv', index_col='timestamp', parse_dates=['timestamp']) \
.resample('T').mean().to_csv('data_1min.csv') # or to_hdf(...)

Related

Convert large spark DF to pandas DF

I have a huge (1258355, 14) pyspark dataframe that has to be converted to pandas df. Probably there is a memory issue (modifying the config file did not work)
pdf = df.toPandas() fails. pdf1 = df.limit(1000) works. How can I iterate through the whole df, convert the slices to pandas df and join these at last?

Efficiently load and store data using Dask by changing one column at a time

I'm in the process of implementing a csv parser using Dask and pandas dataframes. I'd like to make it load only the columns it needs, so it works well with and doesn't need to load large amounts of data.
Currently the only method I've found of writing a column to a parquet/Dask dataframe is by loading all the data as a pandas dataframe, modifying the column and converting from pandas.
all_data = self.data_set.compute() # Loads all data, compute to pandas dataframe
all_data[column] = column_data # Modifies one column
self.data_set = dd.from_pandas(all_data, npartitions=2) # Store all data into dask dataframe
This seems really inefficient, so I was looking for a way to avoid having to load all the data and perhaps modify one column at a time or write directly to parquet.
I've stripped away most of the code but here is an example function that is meant to normalise the data for just one column.
import pandas as pd
import dask.dataframe as dd
def normalise_column(self, column, normalise_type=NormaliseMethod.MEAN_STDDEV):
column_data = self.data_set.compute()[column] # This also converts all data to pd dataframe
if normalise_type is NormaliseMethod.MIN_MAX:
[min, max] = [column_data.min(), column_data.max()]
column_data = column_data.apply(lambda x: (x - min) * (max - min))
elif normalise_type is NormaliseMethod.MEAN_STDDEV:
[mean, std_dev] = [column_data.mean(), column_data.std()]
column_data = column_data.apply(lambda x: (x - mean) / std_dev)
all_data = self.data_set.compute()
all_data[column] = column_data
self.data_set = dd.from_pandas(all_data, npartitions=2)
Can someone please help me make this more efficient for large amounts of data?
Due to the binary nature of the parquet format, and that compression is normally applied to the column chunks, it is never possible to update the values of a column in a file, without a full load-process-save cycle (the number of bytes would not stay constant). At least, Dask should enable you to do this partition-by-partition, without breaking memory.
It would be possible to make custom code to avoid parsing the compressed binary data in columns you know you don't want to change, just read and write again, but implementing this would take some work.

Exporting sorted/adjusted data to excel with python

I have a simple dataset that I have sorted with dataframe based on 'category'.
The sorting has gone all well. But now, I'd like to export the sorted/adjusted dataset in .xlsx format. That is the dataset that has been categorized, not the dataset that is read in excel.
I have tried the following:
import pandas as pd
df = pd.read_excel("python_sorting_test.xlsx",index_col=[1])
df.head()
print(df.sort_index(level=['Category'], ascending=True))
df.to_excel (r'C:\Users\Laptop\PycharmProjects\untitled8\export_dataframe.xlsx', header=True)
The issue: It doesn't doesn't store the sorted/adjusted dataset.
Actually, you doesn't save results of sort_index. You can add inplace=True
print(df.sort_index(level=['Category'], ascending=True, inplace=True))
or save results of df.sort_index
df = df.sort_index(level=['Category'], ascending=True)

Merging two excel files using python with mismatching sizes

I have been trying to merge those two excel files.
Those files are already ready to be joined just as you can see in my image example.
I have tried the solutions from the answer here using pandas and xlwt, but I still can not save both in one file.
Desired result is:
P.s: the two data frames may have mismatch columns and rows which should just be ignored. I am looking for a way to paste one in another using panda.
how can I approach this problem? Thank you in advance,
import pandas as pd
import numpy as np
df = pd.read_excel('main.xlsx')
df.index = np.arange(1, len(df) + 1)
df1 = pd.read_excel('alt.xlsx', header=None, names=list(df))
for i in list(df):
if any(pd.isnull(df[i])):
df[i] = df1[i]
print(df)
df.to_excel("<filename>.xlsx", index=False)
Try this. The main.xlsx is your first excel file while the alt.xlsx is the second one.

Merging two data frames Pandas

Still not getting the hang of pandas, I am attempting to join two data frames in Pandas using merge. I have read in the CSVs into two data frames (named dropData and deosData in the code below). Both data frames have the column ‘Date_Time’, which is a parsed column of Date and Time information to create a unique id for each entry. The deosData file is an entire year’s worth of observations that I am trying to match up with corresponding entries in dropData.
CSV files:
deosData: https://www.dropbox.com/s/3rr7hf7jzrmxdke/inputDeos.csv?dl=0
dropData: https://www.dropbox.com/s/z9mv4xccjzlsyif/inputDrop.csv?dl=0
I have gone through the documentation for the merge function and have tried the following code in various iterations, so far I have only been able to have a blank data frame with correct header row, or have the two data frames merged on the 0--(N-1) indexing that is assigned by default:
My code:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
#read in CSV to dataframe
dropData=pd.read_csv("inputDrop.csv", header=0, index_col=None)
deosData=pd.read_csv("inputDeos.csv", header=0, index_col=None)
#merging dataframes into single sf
merge=pd.merge(dropData,deosData, how='inner', on='Date_Time')
#comment out during debugging
#merge.to_csv('output.csv', sep=',', headers=True, index=False)
#check merge dataframe creation
print merge.head(1)
After searching on SE and the Doc’s I have tried resetting the index, ignoring the index columns, copying the ‘Date_Time’ column as a separate index and trying to merge on the new column, I have tried using ‘on=None’, ‘left_on’ and ‘right_on’ as permutations of ‘Date_Time’ to no avail. I have checked the column data types, ‘Date_Time’ in both are dtype Objects, I do not know if this is the source of the error, since the only issues I could find searching revolved around matching different dtypes to each other.
What I am looking to do is have the two data frames merge where the two 'Date_Time' columns intersect. For example:
Date_Time,Volume(Max),Volume(Sum),Volume(Min),Volume(Mean),Diameter(Count),Diameter(Max),Diameter(Sum),Diameter(Min),Diameter(Mean),Depth(Sum),Velocity(Max),Velocity(Sum),Velocity(Min),Velocity(Mean), Air Temperature (deg. C), Relative humidity (%), Wind Speed (m.s-1), Wind Direction (deg.), Wind Gust Speed (5) (m.s-1), Barometric Pressure (mbar), Gage Precipitation (5) (mm)
9/1/2014 0:00,2.266188524,2.989272461,0.052464219,0.332141385,9,1.629668,5.972978,0.464467,0.663664222,0.003736591,2.288401,16.889656,1.495487,1.876628444,22.5,99,0,216.1,0.4,1016.2,0
Any help would be greatly appreciated.
You need to parse_dates when reading csv file, so that Date_Time columns in both dataframes are of pd.Timestamp object instead of raw strings. (if you look at your csv file, one is in ISO format YYYY-MM-DD HH:MM:SS whereas the other is in MM/DD/YYYY HH:MM) Try the following codes:
#read in CSV to dataframe
dropData = pd.read_csv("inputDrop.csv", header=0, index_col=None, parse_dates=['Date_Time'])
deosData = pd.read_csv("inputDeos.csv", header=0, index_col=None, parse_dates=['Date_Time'])
and then do your merge.
You can use join, but you first need to set the index:
dropData=pd.read_csv('.../inputDrop.csv', header=0, index_col='Date_Time', parse_dates=True)
deosData=pd.read_csv('.../inputDeos.csv', header=0, index_col='Date_Time', parse_dates=True)
dropData.join(deosData)

Categories

Resources