vaex error during head of a large data frame

vaex error during head of a large data frame - python

I am trying to use vaex as alternative to pandas to merge extremely big data frames( 100k rows + 176m rows) on a string column.
The .join seems to work without any error and I can even check .shape of the result data frame but when I try to .head the result a big error stack returns (attaching it bellow).
One of the lines near the end mentions pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays.
My first guess would be that I have not enough RAM but the merge went surprisingly okey. How can I fix this ?

Related

Summarize Pandas dataframe column

I have a large dataframe (~4.7M rows) where one of the columns contains document text. I am trying unsuccessfully to run Gensim summarize on a specific column for the entire dataframe.
df['summary'] = df['variable_content'].apply(lambda x: summarize(x, word_count=200))
Extracting each row of variable_content into a variable and running summarize works well, but is slow and ugly. I also get the error:
ValueError: input must have more than one sentence
but can't find a row with only one sentence (most are hundreds/thousands). Can anyone help?

You have 4.7 million rows, each of which has hundreds or thousands of sentences, and you expect this to work in finite time? That's what I call "optimism". I would suggest looping through your dataframe and running your thing in chunks of, say 1000 rows, saving the work as you go along, and printing out the number of the chunk as you go along. Once it fails, you will know roughly where the failure was, plus you will actually get some results.

Get subset of rows where any column contains a particular value

I have a very large data file (foo.sas7bdat) that I would like to filter rows from without loading the whole data file into memory. For example, I can print the first 20 rows of the dataset without loading the entire file into memory by doing the following:
import pandas
import itertools
with pandas.read_sas('foo.sas7bdat') as f:
for row in itertools.islice(f,20):
print(row)
However, I am unclear on how to only print (or preferably place in a new file) only rows that have any column that contain the number 123.1. How can I do this?

Pandas has the ability to pull dataframes one chunk at a time. Following the trail of read_sas() documentation to "chunksize" I came across this:
http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-chunk
for chunk in pd.read_sas('foo.sas7bdat', interator=True, chunksize=100000):
print(chunk)
This would get chunks of 100,000 lines.
As for the other problem you would need a query. However I don't know the constraints of the problem. If you make a Dataframe with all the columns then you still might overflow your memory space so an efficient way would be to collect the indexes and put those in a set, then sort those and use .iloc to get those entries if you wanted to put those into a Dataframe.
You may need to use tools that take this into account. Dask is a good alternative for use on clusters.

Pandas - Appending 'table' format to HDF5Store with different dtypes: invalid combinate of [values_axes]

I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.

Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.

How to perform a rolling_median on a large data frame in Pandas without encountering the skiplist_insert failed error?

I have a huge data frame with about 1041507 rows.
I wanted to calculate a rolling_median for certain values, under certain categories in my data frame. For this I used a groupBy follwed by apply:
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
However, this given me a MemoryError: skiplist_insert failed. I will attach the full Traceback if needed, but I came across another similar post which specifies that this is an issue in pandas, as can be seen here https://github.com/pydata/pandas/issues/11697. For a very large size >~ 35000
After this i tried to do a bit of manipulation to simply get the rolling median by iterating over each group separately
for index,group in df.groupby(['Category','Subcategory']):
print pd.rolling_median(group['value'],7,min_period=7)
Each group is about 20-25 rows long only. Yet this function fails and shows the same MemoryError after a few iterations.
I ran the code several times, and every time it showed the memory error for different items.
I created some dummy values for anyone to test, here:
index=[]
[index.append(x) for y in range(25) for x in np.arange(34000)]
sample=pd.DataFrame(np.arange(34000*25),index=index)
for index,group in sample.groupby(level=0):
try:
pd.rolling_median(group[0],7,7)
except MemoryError:
print a
print pd.rolling_median(group[0],7,7)
If i run the rolling_median again after encountering the memoryError (as you can see in the above code), it runs fine without another exception-
I am not sure how can I calculate my rolling_median if it keeps throwing the memory Error.
Can anyone tell me a better way to calculate the rolling_median, or help me understand the issue here?

The groupby doesn't look right and should change
df['rolling_median']=df['value'].groupby(['Category','Subcategory']).apply(pd.rolling_median,7,min_periods=7)
to
df['rolling_median']=df.groupby(['Category','Subcategory'])['value'].apply(pd.rolling_median,7,min_periods=7)
Otherwise the the groupby won't work as it is a series with column named ["value"] so can't groupby ['Category','Subcategory'] as not present.
Also the groupby is going to be smaller than the length of the dataframe and creating the df['rolling_median'] will cause a length mismatch.
Hope that helps.

The bug has been fixed in Pandas 0.18.0, and now the methods rolling_mean() and rolling_median() have depreciated.
This was the bug: https://github.com/pydata/pandas/issues/11697
Can be viewed here: http://pandas.pydata.org/pandas-docs/stable/computation.html

Is .loc the best way to build a pandas DataFrame?

I have a fairly large csv file (700mb) which is assembled as follows:
qCode Date Value
A_EVENTS 11/17/2014 202901
A_EVENTS 11/4/2014 801
A_EVENTS 11/3/2014 2.02E+14
A_EVENTS 10/17/2014 203901
etc.
I am parsing this file to get specific values, and then using DF.loc to populate a pre-existing DataFrame, i.e. the code:
for line in fileParse:
code=line[0]
for point in fields:
if(point==code[code.find('_')+1:len(code)]):
date=line[1]
year,quarter=quarter_map(date)
value=float(line[2])
pos=line[0].find('_')
ticker=line[0][0:pos]
i=ticker+str(int(float(year)))+str(int(float(quarter)))
df.loc[i,point]=value
else:
pass
the question I have is .loc the most efficient way to add values to a existing DataFrame? As this operation seems to take over 10 hours...
fyi fields are the col that are in the DF (values i'm interested in) and the index (i) is a string...
thanks

No, you should never build a dataframe row-by-row. Each time you do this the entire dataframe has to be copied (it's not extended inplace) so you are using n + (n - 1) + (n - 2) + ... + 1, O(n^2), memory (which has to be garbage collected)... which is terrible, hence it's taking hours!
You want to use read_csv, and you have a few options:
read in the entire file in one go (this should be fine with 700mb even with just a few gig of ram).
pd.read_csv('your_file.csv')
read in the csv in chunks and then glue them together (in memory)... tbh I don't think this will actually use less memory than the above, but is often useful if you are doing some munging at this step.
pd.concat(pd.read_csv('foo.csv', chunksize=100000))  # not sure what optimum value is for chunksize
read the csv in chunks and save them into pytables (rather than in memory), if you have more data than memory (and you've already bought more memory) then use pytables/hdf5!
store = pd.HDFStore('store.h5')
for df in pd.read_csv('foo.csv', chunksize=100000):
store.append('df', df)

If I understand correctly, I think it would be much faster to:
Import the whole csv into a dataframe using pandas.read_csv.
Select the rows of interest from the dataframe.
Append the rows to your other dataframe using df.append(other_df).
If you provide more information about what criteria you are using in step 2 I can provide code there as well.

A couple of options that come to mind
1) Parse the file as you are currently doing, but build a dict intend of appending to your dataframe. After you're done with that convert that dict to a Dataframe and then use concat() to combine it with the existing Dataframe
2) Bring that csv into pandas using read_csv() and then filter/parse on what you want then do a concat() with the existing dataframe

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.