I have a huge (1258355, 14) pyspark dataframe that has to be converted to pandas df. Probably there is a memory issue (modifying the config file did not work)
pdf = df.toPandas() fails. pdf1 = df.limit(1000) works. How can I iterate through the whole df, convert the slices to pandas df and join these at last?
I'm in the process of implementing a csv parser using Dask and pandas dataframes. I'd like to make it load only the columns it needs, so it works well with and doesn't need to load large amounts of data.
Currently the only method I've found of writing a column to a parquet/Dask dataframe is by loading all the data as a pandas dataframe, modifying the column and converting from pandas.
all_data = self.data_set.compute() # Loads all data, compute to pandas dataframe
all_data[column] = column_data # Modifies one column
self.data_set = dd.from_pandas(all_data, npartitions=2) # Store all data into dask dataframe
This seems really inefficient, so I was looking for a way to avoid having to load all the data and perhaps modify one column at a time or write directly to parquet.
I've stripped away most of the code but here is an example function that is meant to normalise the data for just one column.
import pandas as pd
import dask.dataframe as dd
def normalise_column(self, column, normalise_type=NormaliseMethod.MEAN_STDDEV):
column_data = self.data_set.compute()[column] # This also converts all data to pd dataframe
if normalise_type is NormaliseMethod.MIN_MAX:
[min, max] = [column_data.min(), column_data.max()]
column_data = column_data.apply(lambda x: (x - min) * (max - min))
elif normalise_type is NormaliseMethod.MEAN_STDDEV:
[mean, std_dev] = [column_data.mean(), column_data.std()]
column_data = column_data.apply(lambda x: (x - mean) / std_dev)
all_data = self.data_set.compute()
all_data[column] = column_data
self.data_set = dd.from_pandas(all_data, npartitions=2)
Can someone please help me make this more efficient for large amounts of data?
Due to the binary nature of the parquet format, and that compression is normally applied to the column chunks, it is never possible to update the values of a column in a file, without a full load-process-save cycle (the number of bytes would not stay constant). At least, Dask should enable you to do this partition-by-partition, without breaking memory.
It would be possible to make custom code to avoid parsing the compressed binary data in columns you know you don't want to change, just read and write again, but implementing this would take some work.
I have a simple dataset that I have sorted with dataframe based on 'category'.
The sorting has gone all well. But now, I'd like to export the sorted/adjusted dataset in .xlsx format. That is the dataset that has been categorized, not the dataset that is read in excel.
I have tried the following:
import pandas as pd
df = pd.read_excel("python_sorting_test.xlsx",index_col=[1])
df.head()
print(df.sort_index(level=['Category'], ascending=True))
df.to_excel (r'C:\Users\Laptop\PycharmProjects\untitled8\export_dataframe.xlsx', header=True)
The issue: It doesn't doesn't store the sorted/adjusted dataset.
Actually, you doesn't save results of sort_index. You can add inplace=True
print(df.sort_index(level=['Category'], ascending=True, inplace=True))
or save results of df.sort_index
df = df.sort_index(level=['Category'], ascending=True)
I have been trying to merge those two excel files.
Those files are already ready to be joined just as you can see in my image example.
I have tried the solutions from the answer here using pandas and xlwt, but I still can not save both in one file.
Desired result is:
P.s: the two data frames may have mismatch columns and rows which should just be ignored. I am looking for a way to paste one in another using panda.
how can I approach this problem? Thank you in advance,
import pandas as pd
import numpy as np
df = pd.read_excel('main.xlsx')
df.index = np.arange(1, len(df) + 1)
df1 = pd.read_excel('alt.xlsx', header=None, names=list(df))
for i in list(df):
if any(pd.isnull(df[i])):
df[i] = df1[i]
print(df)
df.to_excel("<filename>.xlsx", index=False)
Try this. The main.xlsx is your first excel file while the alt.xlsx is the second one.
Still not getting the hang of pandas, I am attempting to join two data frames in Pandas using merge. I have read in the CSVs into two data frames (named dropData and deosData in the code below). Both data frames have the column ‘Date_Time’, which is a parsed column of Date and Time information to create a unique id for each entry. The deosData file is an entire year’s worth of observations that I am trying to match up with corresponding entries in dropData.
CSV files:
deosData: https://www.dropbox.com/s/3rr7hf7jzrmxdke/inputDeos.csv?dl=0
dropData: https://www.dropbox.com/s/z9mv4xccjzlsyif/inputDrop.csv?dl=0
I have gone through the documentation for the merge function and have tried the following code in various iterations, so far I have only been able to have a blank data frame with correct header row, or have the two data frames merged on the 0--(N-1) indexing that is assigned by default:
My code:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
#read in CSV to dataframe
dropData=pd.read_csv("inputDrop.csv", header=0, index_col=None)
deosData=pd.read_csv("inputDeos.csv", header=0, index_col=None)
#merging dataframes into single sf
merge=pd.merge(dropData,deosData, how='inner', on='Date_Time')
#comment out during debugging
#merge.to_csv('output.csv', sep=',', headers=True, index=False)
#check merge dataframe creation
print merge.head(1)
After searching on SE and the Doc’s I have tried resetting the index, ignoring the index columns, copying the ‘Date_Time’ column as a separate index and trying to merge on the new column, I have tried using ‘on=None’, ‘left_on’ and ‘right_on’ as permutations of ‘Date_Time’ to no avail. I have checked the column data types, ‘Date_Time’ in both are dtype Objects, I do not know if this is the source of the error, since the only issues I could find searching revolved around matching different dtypes to each other.
What I am looking to do is have the two data frames merge where the two 'Date_Time' columns intersect. For example:
Date_Time,Volume(Max),Volume(Sum),Volume(Min),Volume(Mean),Diameter(Count),Diameter(Max),Diameter(Sum),Diameter(Min),Diameter(Mean),Depth(Sum),Velocity(Max),Velocity(Sum),Velocity(Min),Velocity(Mean), Air Temperature (deg. C), Relative humidity (%), Wind Speed (m.s-1), Wind Direction (deg.), Wind Gust Speed (5) (m.s-1), Barometric Pressure (mbar), Gage Precipitation (5) (mm)
9/1/2014 0:00,2.266188524,2.989272461,0.052464219,0.332141385,9,1.629668,5.972978,0.464467,0.663664222,0.003736591,2.288401,16.889656,1.495487,1.876628444,22.5,99,0,216.1,0.4,1016.2,0
Any help would be greatly appreciated.
You need to parse_dates when reading csv file, so that Date_Time columns in both dataframes are of pd.Timestamp object instead of raw strings. (if you look at your csv file, one is in ISO format YYYY-MM-DD HH:MM:SS whereas the other is in MM/DD/YYYY HH:MM) Try the following codes:
#read in CSV to dataframe
dropData = pd.read_csv("inputDrop.csv", header=0, index_col=None, parse_dates=['Date_Time'])
deosData = pd.read_csv("inputDeos.csv", header=0, index_col=None, parse_dates=['Date_Time'])
and then do your merge.
You can use join, but you first need to set the index:
dropData=pd.read_csv('.../inputDrop.csv', header=0, index_col='Date_Time', parse_dates=True)
deosData=pd.read_csv('.../inputDeos.csv', header=0, index_col='Date_Time', parse_dates=True)
dropData.join(deosData)