using pandas.read_csv() for malformed csv data - python

This is a conceptual question, so no code or reproduceable example.
I am processing data pulled from a database which contains records from automated processes. The regular record contains 14 fields, with a unique ID, and 13 fields containing metrics, such as the date of creation, the time of execution, the customer ID, the type of job, and so on. The database accumulates records at the rate of dozens a day, and a couple of thousand per month.
Sometimes, the processes result in errors, which result in malformed rows. Here is an example:
id1,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,m13 /*regular record, no error, 14 fields*/
id2,m01,m02,m03,m04,m05,m06,m07,m08,m09,m10,m11,m12,"DELETE error, failed" /*error in column 14*/
id3,m01,m02,"NO SUCH JOB error, failed" /*error in column 4*/
id4,m01,m02,m03,m04,m05,m06,"JOB failed, no time recorded" /*error in column 7*/
The requirements are to (1) populate a dashboard from the metrics, and (2) catalog the types of errors. The ideal solution uses read_csv with on_bad_lines set to some function that returns a dataframe. My hacky solution is to munge the data by hand, row by row, and create two data frames from the output. The presence of the bad lines can be reliably detected by the use of the keyword "failed." I have written the logic that collects the "failed" messages and produces a stacked bar chart by date. It works, but I'd rather use a total Pandas solution.
Is it possible to use pd.read_csv() to return 2 dataframes? If so, how would this be done? Can you point me to any example code? Or am I totally off base? Thanks.

You can load your csv file on a Dataframe and apply a filter :
df = pd.read_csv("your_file.csv", header = None)
df_filter = df.apply(lambda row: row.astype(str).str.contains('failed').any(), axis=1)
df[df_filter.values] #this gives a dataframe of "failed" rows
df[~df_filter.values] #this gives a dataframe of "non failed" rows
You need to make sure that your keyword does not appear on your data.
PS : There might be more optimized ways to do it

This approach reads the entire CSV into a single column. Then uses a mask that identifies failed rows to break out and create good and failed dataframes.
Read the entire CSV into a single column
import io
dfs = pd.read_fwf(sim_csv, widths=[999999], header=None)
Build a mask identifying the failed rows
fail_msk = dfs[0].str.contains('failed')
Use that mask to split out and build separate dataframes
df_good = pd.read_csv(io.StringIO('\n'.join(dfs[~fail_msk].squeeze())), header=None)
df_fail = pd.read_csv(io.StringIO('\n'.join(dfs[fail_msk].squeeze())), header=None)

Related

Applying corrections to a subsampled copy of a dataframe back to the original dataframe?

I'm a Pandas newbie, so please bear with me.
Overview: I started with a free-form text file created by a data harvesting script that remotely accessed dozens of different kinds of devices, and multiple instances of each. I used OpenRefine (a truly wonderful tool) to munge that into a CSV that was then input to dataframe df using Pandas in a JupyterLab notebook.
My first inspection of the data showed the 'Timestamp' column was not monotonic. I accessed individual data sources as follows, in this case for the 'T-meter' data source. (The technique was taken from a search result - I don't really understand it, but it worked.)
cond = df['Source']=='T-meter'
rows = df.loc[cond, :]
df_tmeter = pd.DataFrame(columns=df.columns)
df_tmeter = df_tmeter.append(rows, ignore_index=True)
then checked each as follows:
df_tmeter['Timestamp'].is_monotonic
Fortunately, the problem was easy to identify and fix: Some sensors were resetting, then sending bad (but still monotonic) timestamps until their clocks were updated. I wrote the function healing() to cleanly patch such errors, and it worked a treat:
df_tmeter['healed'] = df_tmeter['Timestamp'].apply(healing)
Now for my questions:
How do I get the 'healed' values back into the original df['Timestamp'] column for only the 'T-meter' items in df['Source']?
Given the function healing(), is there a clean way to do this directly on df?
Thanks!
Edit: I first thought I should be using 'views' into df, but other operations on the data would either generate errors, or silently turn the views into copies.
I wrote a wrapper function heal_row() for healing():
def heal_row( row ):
if row['Source'] == 'T-meter': # Redundant check, but safe!
row['Timestamp'] = healing(row['Timestamp'])
return row
then did the following:
df = df.apply(lambda row: row if row['Source'] != 'T-meter' else heal_row(row), axis=1)
This ordering is important, since healing() is stateful based on the prior row(s), and thus can't be the default operation.

Get subset of rows where any column contains a particular value

I have a very large data file (foo.sas7bdat) that I would like to filter rows from without loading the whole data file into memory. For example, I can print the first 20 rows of the dataset without loading the entire file into memory by doing the following:
import pandas
import itertools
with pandas.read_sas('foo.sas7bdat') as f:
for row in itertools.islice(f,20):
print(row)
However, I am unclear on how to only print (or preferably place in a new file) only rows that have any column that contain the number 123.1. How can I do this?
Pandas has the ability to pull dataframes one chunk at a time. Following the trail of read_sas() documentation to "chunksize" I came across this:
http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk
for chunk in pd.read_sas('foo.sas7bdat', interator=True, chunksize=100000):
print(chunk)
This would get chunks of 100,000 lines.
As for the other problem you would need a query. However I don't know the constraints of the problem. If you make a Dataframe with all the columns then you still might overflow your memory space so an efficient way would be to collect the indexes and put those in a set, then sort those and use .iloc to get those entries if you wanted to put those into a Dataframe.
You may need to use tools that take this into account. Dask is a good alternative for use on clusters.

HDFStore get column names

I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!
You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.
For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')
Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.

Selecting Pandas DataFrame Rows Based On Conditions

I am new to Python and getting to grips with Pandas. I am trying to perform a simple import CSV, filter, write CSV but can't the filter seems to be dropping rows of data compared to my Access query.
I am importing via the command below:
Costs1516 = pd.read_csv('C:......../1b Data MFF adjusted.csv')
Following import I get a data warning that the service code column contains data of multiple types (some are numerical codes others are purely text) but the import seems to attribute data type Object which I thought would just treat them both as strings and all would be fine....
I want the output dataframe to have the same structure as the the imported data (Costs1516), but only to include rows where 'Service Code' = '110'.
I have pulled the following SQL from Access which seems to do the job well, and returns 136k rows:
SELECT [1b Data MFF adjusted].*, [1b Data MFF adjusted].``[Service code]
FROM [1b Data MFF adjusted]
WHERE ((([1b Data MFF adjusted].[Service code])="110"));
My pandas equivalent is below but only returns 99k records:
Costs1516Ortho = Costs1516.loc[Costs1516['Service code'] == '110']
I have compared the two outputs and I can't see any reason why pandas is excluding some lines and including others....I'm really stuck...any suggested areas to look or approaches to test gratefully received.

Pandas Combine Dataframe Options for More than one instance of Key?

I am using Python 3.4, and Windows 7. Here is a sample of my first Dataframe: Sample Data
Here is my second DataFrame: Sample Data 2
My goal is use the "RTID" as my key. However, as evidenced by the data that I have parsed from another data structure, there appears to be duplicate keys. Moreover, the requirement necessitates that each RTID has a unique transaction type.
I have many more of these data frames (some of which also share common column header names) that need to be combined into one cohesive dataframe. Each row value's integrity is maintained with its header. Duplicate column names should only appear once in the final product with the respective values appended sequentially with each the respective row (hence my initial thought for using the RTID column as a key) and for missing or non-applicable values - an empty space. My initial thought was to concatenate, but, due to the various dtypes, I receive the following error:
AssertionError: invalid dtype determination in get_concat_dtype
This can be sourced here: Pandas/Internals.py
#EdChum and #BrianPendleton were very helpful with the Memory management issue.
I am wondering if join and merge could be valid use cases for this specific context. I welcome feedback on this.
I am referencing pg. 188 of Python for Data Analysis for my answer. After reviewing the various methods offered, I was able to achieve the end product.
Citing the above two sample data sources (and dropping the indexes):
sample1 = pd.read_csv('sample_data.csv', dtype=str, error_bad_lines = False)
sample2 = pd.read_csv('sample2.csv', dtype=str, error_bad_lines = False)
sample_concat = pd.concat([sample1, sample2], keys = ['one', 'two'], ignore_index=True)
This produced the correct ouput. It turns out that I was overthinking the problem. In this context, the row index is not meaningful. The ignore_index = False parameter allows one to not preserve indexes along the concatenation axis. This is useful as I was not seeking to find the intersection of the data sets (which, in theory should not be apparent in the data structure that I am wrangling).

Categories

Resources