This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.
You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it
This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)
So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.
I have one data frame with 1782568 distinct groups.
So, when i melt that data by grouping level my kernal got stuck.
So, I am decided to to melt the data by group wise and then i will combine all of them sequentially.
For that I wrote the following function.
def split(df,key):
df2=pd.DataFrame()
for i in range(df[key].drop_duplicates().shape[0]):
grp_key=tuple(df[key].drop_duplicates().iloc[i,:])
df1=df.groupby(key,as_index=False).
get_group(grp_key).reset_index().drop('index',axis=1)
df2=df2.append(df1.groupby(key,as_index=False).
apply(pd.melt,id_vars=key).reset_index()).dropna()
df2=df2.drop(grep('level',df2.columns),axis=1)
return(df2)
here grep is my user defined function, it is working as grep function in R.
In df i would pass data frame and in key i would pass grouping keys in list format.
But the function also took very huge time to complete the process.
Can any one help me to improve the performance.
Thanks in Advance.
I have to read massive csv files (500 million lines), and I tried to read them with pandas using the chunksize method, in order to reduce memory consumption. But I didn't understand the behaviour of the concat method and the option to read all the file and reduce memory. I'm adding some pseudocode in order to explain what I did so far.
Let's say I'm reading and then concatenate a file with n lines with:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
df = pd.concat([chunk for chunk in iter_csv])
Then I have to apply a function to the dataframe to create a new column based on some values:
df['newcl'] = df.apply(function)
Everything goes fine.
But now I wonder what's the difference between the above procedure and the following:
iter_csv = pd.read_csv(file.csv,chunksize=n/2)
for chunk in iter_csv:
chunk['newcl'] = chunk.apply(function)
df = pd.concat([chunk])
In terms of RAM consumption, I thought that the second method should be better because it applies the function only to the chunk and not to the whole dataframe. But the following issues occur:
putting the df = pd.concat([chunk]) inside the loop returns me a dataframe with a size of n/2 (the size of the chunk), and not the full one;
putting the df = pd.concat([chunk]) outside, after the loop returns the same n/2 dataframe length.
So my doubt is whether the first method (concatenate the dataframe just after the read_csv function) is the best one, balancing speed and RAM consumption. And I'm also wondering how may I concat the chunks using the for loop.
Thanks for your support.
I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?
You should build a list of the pieces and concatenate them all in one shot at the end.