I have 10million row, 60 column dataframe that I read in from a parquet file.
I have a the line of code (below) that pivots my dataframe plus does 3 other lines of manipulation exactly how I need it. However this line of code only works on smaller datasets, not the larger dataset:
pivoted_df = pd.pivot_table(df.fillna('missing'), index=cols, columns='Field', values='Value', aggfunc='first').reset_index().replace('missing', np.nan)
pivoted_df = pivoted_df.drop(['FieldId', 'FieldType'], axis=1)
pivoted_df = pivoted_df.replace('nan', np.nan)
pivoted_df = pivoted_df.groupby('Id', as_index=False).last()
Is there anyway I can chunk the data from df, while pivoting the chunks individually, cleaning and joining the pivoted data all together later?
kernel keeps crashing both in spyder and terminal.
Open to use any other tools to do this.
I broke up the dataframe into evenly sized pieces using:
import numpy as np
z = np.array_split(df, 5)
then iterated over the list:
for i in z:
(rest of code)
Related
I have many .csv files with two columns. One with timestamps and the other with values. The data is sampled on seconds. What I would like to do is
read all files
set index on time column
resample on hours
save to new files (parquet, hdf,...)
1) Only dask
I tried to use dask's read_csv.
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv(
"../data_*.csv",
parse_dates = [0],
date_parser = lambda x: pd.to_datetime(float(x)),
)
So far that's fine. The problem is that I cannot df.resample("min").mean() directly, because the index of dask data frame is not properly set.
After calling dd.reset_index().set_index("timestamp") it works - BUT I cannot afford to do this because it is expensive.
2) Workaround with pandas and hdf files
Another approach was to save all csv files to hdf files using pandas. In this case the pandas dataframes were already indexed by time.
df= dd.read_hdf("/data_01.hdf", key="data")
# This doesn't work directly
# df = df.resample("min").mean()
# Error: "Can only resample dataframes with known divisions"
df = df.reset_index().set_index("timestamp") # expensive! :-(
df = df.resample("min").mean() # works!
Of course this works but it would be extremely expensive on dd.read_hdf("/data_*.hdf", key="data").
How can I directly read timeseries data in dask that it is properly partitioned and indexed?
Do you have any tips or suggestions?
Exmpample Data:
import dask
df = dask.datasets.timeseries()
df.to_hdf("dask.hdf", "data")
# Doesn't work!
# dd.read_hdf("dask.hdf", key="data").resample("min").mean()
# Works!
dd.read_hdf("dask.hdf", key="data").reset_index().set_index("timestamp").resample(
"min"
).mean()
Can you try something like:
pd.read_csv('data.csv', index_col='timestamp', parse_dates=['timestamp']) \
.resample('T').mean().to_csv('data_1min.csv') # or to_hdf(...)
I have run across an unexplainable problem with Pandas regarding inserting a numpy array into the cell of a Dataframe. I am aware of the standard fixes for this error, but this case seems to be a little stranger.
I load a .csv file into a dataframe, collapse the last 1000 rows into a numpy array and place them back into a cell in a new dataframe.
The header format of the original file is essentially:
["TIMESTAMP", "Header0", ... , "Header10", "Waveform (1)", ... , "Waveform (1001)"]
# Load in dataframe from csv file, skipping some extra headers
input_df = pd.read_csv(filepath, header=1, skiprows=[2, 3], index_col='TIMESTAMP', parse_dates=True)
# Header columns that are not part of the waveform
header_columns = ["Header0", "Header1", "Header2", "Header3",
"Header4", "Header5", "Header6", "Header7",
"Header8", "Header9", "Header10"]
# Create a version of the input df that is only the Waveform columns, dropping the last one
waveform_df = input_df.drop(columns=header_columns).iloc[:, :-1]
# Create a version of the input df that is only the Header columns
header_df = input_df[indexing_columns]
# Create an output df that is a copy of the header df with a column added for storing the compressed waveforms
output_df = header_df.copy()
output_df['Waveform'] = None
# for every row in the waveform dict (which will be the same len as output_df)
for index, row in waveform_df.iterrows():
# create a numpy array from all the waveform columns of that row
waveform_array = waveform_df.loc[index].to_numpy()
# save the numpy array in the 'Waveform' column of the current row by index
output_df.at[index, 'Waveform'] = waveform_array
# return the compressed df
return output_df
This code actually works perfectly fine most of the time. The file I originally tested it on worked fine, and had 30 or so rows in it. However, when I attempted to run the same code on a different file with the same header formats which was about 1200 rows, it gave the error:
ValueError: Must have equal len keys and value when setting with an iterable
for the line:
output_df.at[index, 'Waveform'] = waveform_array
I compared the two incessantly and discovered that the only real difference between the two is the length, and in fact, if I crop the longer dataframe to a similar length to the shorter one, the error disappears. (Specifically, it works so long as I crop the dataframe to less than 250 rows)
I'd really like to be able to do this without cropping the dataframe and rebuilding it, so I was wondering if anyone had an insight into why this error was occurring.
Thanks!
So I am working with a fairly substantial CSV dataset that has couple hundred megabytes. I have managed to read in the data in chunks (~100 rows).
How do i then elegantly convert those chunks into a dataframe and apply the describe function to it?
Thank you
It seems you need concat of TextFileReader object what is output of read_csv if parameter chunksize with describe:
df = pd.concat([x for x in pd.read_csv('filename', chunksize=1000)], ignore_index=True)
df = df.describe()
print (df)
I'm reading in an excel .csv file using pandas.read_csv(). I want to read in 2 separate column ranges of the excel spreadsheet, e.g. columns A:D AND H:J, to appear in the final DataFrame. I know I can do it once the file has been loaded in using indexing, but can I specify 2 ranges of columns to load in?
I've tried something like this....
usecols=[0:3,7:9]
I know I could list each column number induvidually e.g.
usecols=[0,1,2,3,7,8,9]
but I have simplified the file in question, in my real file I have a large number of rows so I need to be able to select 2 large ranges to read in...
I'm not sure if there's an official-pretty-pandaic-way to do it with pandas.
But, You can do it this way:
# say you want to extract 2 ranges of columns
# columns 5 to 14
# and columns 30 to 66
import pandas as pd
range1 = [i for i in range(5,15)]
range2 = [i for i in range(30,67)]
usecols = range1 + range2
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=usecols)
As #jezrael notes you can use numpy.r to do this in a more pythonic and legible way
import pandas as pd
import numpy as np
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=np.r_[0:3, 7:9])
Gotchas
Watch out when use in combination with names that you have allowed for the extra column pandas adds for the index ie. For csv columns 1,2,3 (3 items) np.r_ needs to be 0:3 (4 items)
I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!
I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.
The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)
Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.