Im trying to read big csv files and also effectively work on other stuff at the same time. That is why my solution to this problem is to create a progress bar (something that shows how far Ive come threw out the read that gives me a sense of what time I have before the read is complete). However I have tried using tqdm aswell as ownmade while loops, but to my disfortune, I have not found a solution to this problem. I have tried using this thread: How to see the progress bar of read_csv
without no luck. Maybe I can apply TQDM in a different way? Are there any other solutions?
Heres the important part of the code (the one I want to add a progress bar to)
def read_from_csv(filepath: str,
sep: str = ",",
header_line: int = 43,
skip_rows: int = 48) -> pd.DataFrame:
"""Reads a csv file at filepath containing the vehicle trip data and
performs a number of formatting operations
"""
# The first call of read_csv is used to get the column names, which allows
# the typing to take place at the same time as the second read, which is
# faster than forcing type afterwards
df_names: pd.Index[str] = pd.read_csv(
filepath,
sep = sep,
header = header_line,
skip_blank_lines = False,
skipinitialspace = True,
index_col = False,
engine = 'c',
nrows = 0,
encoding = 'iso-8859-1'
).columns
# The "Time" and "Time_abs" columns have some inconsistent
# "Storage group code" preceeding the actual column name, so their
# full column names are stored so they can be renamed later. Also, we want
# to interpret "Time_abs" as a string, while the rest are floats. This is
# stored in a dict to use in the next call to read_csv
time_col = ""
time_abs_col = ""
names_dict = {}
for name in df_names:
if ": Time_abs" in name:
names_dict[name] = 'str'
time_abs_col = name
elif ": Time" in name:
time_col = name
else:
names_dict[name] = 'float'
# A list of values that we want pandas to interpret as having no value.
# "NOVALUE" is the only one of these that's actually used in the files,
# the rest are copy-pasted defaults.
na_vals = ['', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
'1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a',
'nan', 'null', 'NOVALUE']
# The whole file is parsed and put in a dataframe
df: pd.DataFrame = pd.read_csv(filepath,
sep = sep,
skiprows = skip_rows,
header = 0,
names = df_names,
skip_blank_lines = False,
skipinitialspace = True,
index_col = False,
engine = 'c',
na_values = na_vals,
dtype = names_dict,
encoding = 'iso-8859-1'
)
# Renames the "Time" and "Time_abs" columns so they don't include the
# storage group part
df.rename(columns = {time_col: "Time", time_abs_col: "Time_abs"},
inplace = True)
# Second retyping of this column (here from string to datetime).
# Very rarely, the Time_abs column in the csv data only has the time and
# not the date, in which case this line throws an error. We manage this by
# simply letting it stay as a string
try:
df[defs.time_abs] = pd.to_datetime(df[defs.time_abs])
except:
pass
# Every row ends with an extra delimiter which python interprets as another
# column, but it's empty so we remove it. This is not really necessary, but
# is done to reduce confusion when debugging
df.drop(df.columns[-1], axis=1, inplace=True)
# Adding extra columns to the dataframe used later
df[defs.lowest_gear] = np.nan
df[defs.lowest_speed] = np.nan
for i in list(defs.second_trailer_axles_dict.values()):
df[i] = np.nan
return df
Its the reading csv that takes a lot of time thats why that is the point of interest to add a progress bar to.
Thank you in advance!
You can easily do this with Dask. For example:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
ddf = dd.read_csv(path, blocksize=1e+6)
with ProgressBar():
df = ddf.compute()
[########################################] | 100% Completed | 37.0s
And you will see the file download process.
the blocksize parameter is responsible for the blocks that your file is read with. By changing it, you can achieve good performance. Plus, Dask uses several threads for reading by default, which will speed up the reading process itself.
You can use tqdm.
Somewhere in your function:
def read_from_csv(filepath: str,
sep: str = ",",
header_line: int = 43,
skip_rows: int = 48,
chunksize: int = 10000) -> pd.DataFrame:
# Count the total lines of the file
# Overhead: 3.73s for 10,000,000 lines / 4.2G on a SSD
length = sum(1 for row in open('large.csv', 'r'))
data = []
with tqdm(total=1 + (length // chunksize)) as pbar:
# Replace your 2nd pd.read_csv by this:
for chunk in pd.read_csv('large.csv', ..., chunksize=chunksize):
data.append(chunk)
pbar.update(chunksize)
df = pd.concat(data)
Since there was a question in the comments about the progress bar for some pandas dataframe methods, I will note the solution for such cases. Library parallelbar allows you to track the progress for such popular methods of the Pool class of the multiprocessing module as map, imap and imap_unordered. It is easy to adapt it for parallel work with pandas dataframes (and track progress) as follows:
# pip install parallelbar
from parallelbar import progress_map
import pandas as pd
import numpy as np
from multiprocessing import cpu_count
def parallelize_dataframe(df, func, split_size=cpu_count() * 4, **kwargs):
df_split = np.array_split(df, split_size)
result_df = pd.concat(progress_map(func, df_split, **kwargs),
ignore_index=True)
return result_df
Where df - your dataframe; func - the function to be applied to the dataframe; split_size - how many parts you need to split df into for parallelization (usually the default value is a good choice); **kwargs-optional keyword arguments for progress_map function (see the documentation)
For example:
def foo(df):
df[col] = pd.to_datetime(df[col])
return df
if __name__=='__main__':
new_df = parallelize_dataframe(df, foo)
not only will you see the progress of execution, but the execution of the pd.to_datetime function will be parallelized, which will significantly speed up your work.
Related
I'm using pandas to write a parquet file using the to_parquet function with partitions. Example:
df.to_parquet('gs://bucket/path', partition_cols=['key'])
The issue is that every time I run the code. It adds a new parquet file in the partition and when you read data, you get all the data from each time the script was run. Essentially, the data appends each time.
Is there a way to overwrite the data every time you write using pandas?
I have found dask to be helpful reading and writing parquet. It defaults the file name on write (which you can alter) and will replace the parquet file if you use the same name, which I believe is what you are looking for. You can append data to the partition by setting 'append' to True, which is more intuitive to me, or you can set 'overwrite' to True which will remove all files in the partition/folder prior to writing the file. Reading parquet works well as well by including partition columns in the dataframe on read.
https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html
See below some code I used to satisfy myself of the behaviour of dask.dataframe.to_parquet:
import pandas as pd
from dask import dataframe as dd
import numpy as np
dates = pd.date_range("2015-01-01", "2022-06-30")
df_len = len(dates)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(df_len, 1)), columns=["value"])
df_1["date"] = dates
df_1["YEAR"] = df_1["date"].dt.year
df_1["MONTH"] = df_1["date"].dt.month
df_2["date"] = dates
df_2["YEAR"] = df_2["date"].dt.year
df_2["MONTH"] = df_2["date"].dt.month
ddf_1 = dd.from_pandas(df_1, npartitions=1)
ddf_2 = dd.from_pandas(df_2, npartitions=1)
name_function = lambda x: f"monthly_data_{x}.parquet"
ddf_1.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_1.head())
ddf_first_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_first_write.head())
ddf_2.to_parquet(
"dask_test_folder",
name_function=name_function,
partition_on=["YEAR", "MONTH"],
write_index=False,
)
print(ddf_2.head())
ddf_second_write = dd.read_parquet("dask_test_folder/YEAR=2015/MONTH=1")
print(ddf_second_write.head())
Yeah, there is. You need to read pandas docs and you'll see that to_parquet supports **kwargs and uses engine:pyarrow by default. With that you got to the pyarrow docs. There you'll see there are two methods of doing this. One, by using partition_filename_cb which needs legacy support and will be depricated.
Two, using basename_template which is the new way. This because of performance issues of running a callable/lambda to name each partition. You need to pass a string: "string_{i}". Only works with legacy support off. The saved file will be "string_0","string_1"...
You can't use both at the same time.
def write_data(
df: pd.DataFrame,
path: str,
file_format="csv",
comp_zip="snappy",
index=False,
partition_cols: list[str] = None,
basename_template: str = None,
storage_options: dict = None,
**kwargs,
) -> None:
getattr(pd.DataFrame, f"to_{file_format}")(
df,
f"{path}.{file_format}",
compression=comp_zip,
index=index,
partition_cols=partition_cols,
basename_template=basename_template,
storage_options={"token": creds},
**kwargs,
)
Try this.
I have a function, which does some operations on each DataFrame column and extracts a shorter series from it (in the original code there is some time consuming calculations going on)
Then it adds it to a dictionary before it goes on with the next columns.
In the end it creates a dataframe from the dictionary and manipulates its index.
How can I parallelize the loop in which each column is manipulated?
This is a less complicated reproducable sample of the code.
import pandas as pd
raw_df = pd.DataFrame({"A":[ 1.1 ]*100000,
"B":[ 2.2 ]*100000,
"C":[ 3.3 ]*100000})
def preprocess_columns(raw_df, ):
df = {}
width = 137
for name in raw_df.columns:
'''
Note: the operations in this loop do not have a deep sense and are just for illustration of the function preprocess_columns. In the original code there are ~ 50 lines of list comprehensions etc.
'''
# 3. do some column operations. (actually theres more than just this operation)
seriesF = raw_df[[name]].dropna()
afterDropping_indices = seriesF.index.copy(deep=True)
list_ = list(raw_df[name])[width:]
df[name]=pd.Series(list_.copy(), index=afterDropping_indices[width:])
# create df from dict and reindex
df=pd.concat(df,axis=1)
df=df.reindex(df.index[::-1])
return df
raw_df = preprocess_columns(raw_df )
Maybe you can use this:
https://github.com/xieqihui/pandas-multiprocess
pip install pandas-multiprocess
from pandas_multiprocess import multi_process
args = {'width': 137}
result = multi_process(func=func, data=df, num_process=8, **args)
I have a dataframe fulldb_accrep_united of such kind:
SparkID ... Period
0 913955 ... {"#PeriodName": "2000", "#DateBegin": "2000-01...
1 913955 ... {"#PeriodName": "1999", "#DateBegin": "1999-01...
2 16768 ... {"#PeriodName": "2007", "#DateBegin": "2007-01...
3 16768 ... {"#PeriodName": "2006", "#DateBegin": "2006-01...
4 16768 ... {"#PeriodName": "2005", "#DateBegin": "2005-01...
I need to convert Period column, which is now column of strings into a column of json values. Usually I do it with df.apply(lambda x: json.loads(x)), but this dataframe is too large to process it as a whole. I want to use dask, but I seem to miss something important. I think I don't understand how to use apply in dask, but I can't find out the solution.
The codes
This is how I supposed to do it if using Pandas with all df in memory:
#%% read df
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = pd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', index_col = 0, encoding = 'utf-8')
os.chdir('..')
#%% Deleting some freaky symbols from column
condition = fulldb_accrep_united['Period'].str.contains('\\xa0', na = False, regex = False)
fulldb_accrep_united.loc[condition.values, 'Period'] = fulldb_accrep_united.loc[condition.values, 'Period'].str.replace('\\xa0', ' ', regex = False).values
#%% Convert to json
fulldb_accrep_united.loc[fulldb_accrep_united['Period'].notnull(), 'Period'] = fulldb_accrep_united['Period'].dropna().apply(lambda x: json.loads(x))
This is the code where i try to use dask:
#%% load data with dask
os.chdir('/opt/data/.../download finance/output')
fulldb_accrep_united = dd.read_csv('fulldb_accrep_first_download_raw_quotes_corrected.csv', encoding = 'utf-8', blocksize = 16 * 1024 * 1024) #16Mb chunks
os.chdir('..')
#%% setup calculation graph. No work is done here.
def transform_to_json(df):
condition = df['Period'].str.contains('\\xa0', na = False, regex = False)
df['Period'] = df['Period'].mask(condition.values, df['Period'][condition.values].str.replace('\\xa0', ' ', regex = False).values)
condition2 = df['Period'].notnull()
df['Period'] = df['Period'].mask(condition2.values, df['Period'].dropna().apply(lambda x: json.loads(x)).values)
result = transform_to_json(fulldb_accrep_united)
The last cell here gives error:
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
What I do wrong? I tried to find similar topics for almost 5 hours, but I think I am missing something important, cause I am new to the topic.
Your question was long enough that I didn't read through all of it. My apologies. See https://stackoverflow.com/help/minimal-reproducible-example
However, based on the title, it may be that you want to apply the json.loads function across every element in a dataframe's column
df["column-name"] = df["column-name"].apply(json.loads)
I have a recording of a Can-Bus transmission and I want to analyze it now. In the past, I used Excel for it. But now I am faced with huge amounts of data (> 10GB). With "pd.read_csv" I can load the data wonderfully into a data frame. But the hexadecimal numbers are called a string in the following form "6E" and not "0x6E". Furthermore, some columns are filled with "None".
In the second paragraph I pointed out that I tested it with a for loop and an if query on None, this works, but this procedure takes a very long time
def load_data(self, file_folder, file_type):
df_local_list = []
# Load-Filenames as string in list "all_files"
full_path = glob.glob(file_folder + "/*." + file_type)
self.all_files = natsort.natsorted(full_path)
# Walk through all files and load the content in list "self.df"
for file in self.all_files:
# Read file-content to data-frame-variable "self.df"
local_df = pd.read_csv(file, names=self.header_list,
delim_whitespace=True, skiprows=12, skipfooter=3, header=13,
error_bad_lines=False, engine='python')
# Save the file-content without the last two lines --> End-Header
# self.df_list.append(local_df[:-2])
df_local_list.append(local_df)
self.df = pd.concat(df_local_list, axis=0, ignore_index=True)
self.df['Byte0_int'] = ('0x' + self.df['Byte0']).apply(int, base=0)
I would like to have a fast function which converts selected columns from hex to int, skipping the "None" values.
I had a similar issue and I did this :
self.df['Byte0_int'] = self.df['Byte0_int'].dropna().map(lambda x:int(x, 16))
In short, I remove NaN first, then I convert everything else from hex to int
No need to prefix "0x" since they are processed the same:
>>> int("0x5e",16)
94
>>> int("5e",16)
94
I suggest to profile the timings of your code because the concat could be costly. You also could have a look at dask to process many files
I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.
I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
"season":["summer","spring"]}
I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:
df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
for chunk in f:
jfile = json.loads(chunk)
if 'groupe' in jfile['location']:
groupe = jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id = jfile['id']
else:
id = np.nan
if 'MotherName' in jfile['Mother']:
MotherName = jfile['Mother']['MotherName']
else:
MotherName = np.nan
if 'FatherName' in jfile['Father']:
FatherName = jfile['Father']['FatherName']
else:
FatherName = np.nan
df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
ignore_index=True)
I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.
You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.
I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.
import numpy as np
import pandas as pd
import json
import time
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
#
# The original way
#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if 'groupe' in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id=jfile['id']
else:
id=np.nan
if 'MotherName' in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if 'FatherName' in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)
The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else structure with a simple get that returns a default value (e.g. np.nan) if the item is not found in the dictionary.
with open (path/to/file) as f:
d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
for chunk in f:
jfile = json.loads(chunk)
d['groupe'].append(jfile['location'].get('groupe', np.nan))
d['id'].append(jfile.get('id', np.nan))
d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))
df = pd.DataFrame(d)