What would be the fastest way to convert a Redis Stream output (aioredis client/ hiredis parser) to a Pandas Dataframe where Redis Stream ID‘s timestamp and sequence number as well as values are proper type converted Pandas index columns?
Example Redis output:
[[b'1554900384437-0', [b'key', b'1']],
[b'1554900414434-0', [b'key', b'1']]]
There seem to be two main bottlenecks here:
Pandas DataFrames store their data in column-major format, meaning each column maps to one numpy array, whereas the Redis stream data is row-by-row.
Pandas MultiIndex is made for categorical data, and converting raw arrays to the required levels/code structure seems to be non-optimized
Due to number 1. it is inevitable to loop over all Redis stream entries. Assuming we know the length beforehand, we can pre-allocate numpy arrays that we fill as we go along, and with some tricks reuse these arrays as the DataFrame columns. If the overhead of looping in Python is still too much, rewriting in Cython should be straightforward.
Since you didn't specify datatypes, the answer keeps everything in bytes using numpy.object arrays, it should be reasonably obvious how to adapt to a custom setting. The only reason to put all of the columns in the same array is to move an inner loop over the columns/fields from Python to C. It can be split up into e.g. one array per data type or one array per column.
from functools import partial, reduce
import numpy as np
import pandas as pd
data = [[b'1554900384437-0', [b'foo', b'1', b'bar', b'2', b'bla', b'abc']],
[b'1554900414434-0', [b'foo', b'3', b'bar', b'4', b'bla', b'xyz']]]
colnames = data[0][1][0::2]
ncols = len(colnames)
nrows = len(data)
ts_seq = np.empty((2, nrows), dtype=np.int64)
cols = np.empty((ncols, nrows), dtype=np.object)
for i,(id,fields) in enumerate(data):
ts, seq = id.split(b"-", 2)
ts_seq[:, i] = (int(ts), int(seq))
cols[:, i] = fields[1::2]
colframes = [pd.DataFrame(cols[i:i+1, :].T) for i in range(ncols)]
merge = partial(pd.merge, left_index=True, right_index=True, copy=False)
df = reduce(merge, colframes[1:], colframes[0])
df.columns = colnames
For number 2. we can use numpy.unique to create the levels/codes structure needed by Pandas MultiIndex. From the documentation it seems that numpy.unique also sorts the data. Since our data is presumably already sorted, a possible future optimisation would be to try to skip the sorting step.
ts = ts_seq[0, :]
seq = ts_seq[1, :]
maxseq = np.max(seq)
ts_levels, ts_codes = np.unique(ts, return_inverse=True)
seq_levels = np.arange(maxseq+1)
seq_codes = seq
df.index = pd.MultiIndex(levels=[ts_levels, seq_levels], codes=[ts_codes, seq_codes], names=["Timestamp", "Seq"])
Finally, we can verify that there was no copying involved by doing
cols[0, 0] = b'79'
and checking that the entries in df do indeed change.
The quickest way is to process data using batches
IO in batches of N msgs (i.e. 100 messages per batch)
Convert this batch into 1 Dataframe (using pd.DataFrame([]))
Apply lambda or convertation function to timestamp column converted to numpy (.values). a-la:
df['time'] = [datetime.fromtimestamp(t.split('-')[0]) for t in df['time'].values]
you can use this:
pd.read_msgpack(redisConn.get("key"))
Related
I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(
If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)
There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays
I have a fairly large pandas dataframe df. I also have a pandas series of scale factors factors.
I want to scale df for every scale factor in factors and concatenate these dataframes together into a larger dataframe. Since this large dataframe will not fit into memory, I thought it may be good to use dask dataframe for the same. But I dont know how to get around this problem.
Below is what i want to achieve, but using pandas dataframes. The dflarge in the actual case will not fit in memory.
import random
import pandas as pd
df = pd.DataFrame({
'id1': range(1,6),
'a': [random.random() for i in range(5)],
'b': [random.random() for i in range(5)],
})
df = df.set_index('id1')
factors = [random.random() for i in range(10)]
dflist = []
for i, factor in enumerate(factors):
scaled = df*factor
scaled['id2'] = i
dflist.append(scaled)
dflarge = pd.concat(dflist)
dflarge = dflarge.reset_index().set_index(['id1', 'id2'])
I would like to make the scaling and concatenating as efficient as possible since there will be tens of thousands of scale factors. I'd like to run it distributed if possible.
I really appreciate any kind of help you can give.
Just delay it!
Dask.dataframe and dask.delayed are what you need here, and running it using dask.distributedshould work fine. Assuming that df is still a pandas.DataFrame, turn the loop into a function that you can call in a list comprehension using dask.delayed. I've made some small changes to your code below:
import random
import pandas as pd
import dask.dataframe as dd
from dask import delayed
df = pd.DataFrame({
'id1': range(1,6),
'a': [random.random() for i in range(5)],
'b': [random.random() for i in range(5)],
})
df = df.set_index('id1')
factors = [random.random() for i in range(10)]
dflist = []
def scale_my_df(df_init, scale_factor, id_num):
'''
Scales and returns a DataFrame.
'''
df_scaled = df_init * scale_factor
df_scaled['id2'] = id_num
return df_scaled
dfs_delayed = [delayed(scale_my_df)(df_init=df, scale_factor=factor, id_num=i)
for i, factor in enumerate(factors)]
ddf = dd.from_delayed(dfs_delayed)
And now you have a dask.DataFrame built from your scaled pandas.DataFrames. Two things of note:
Dask is lazy, so as of the end of this code snippet nothing has been computed. A computational graph has been setup with the required operations to create the DataFrame you want. In this example with small DataFrames, you could execute:
ddf_large = ddf.compute()
And you will have the same pandas.DataFrame as dflarge in your code above, assuming the factors are the same. Almost...
As of this writing dask does not appear to support multi-level indices, so your .set_index(['id1', 'id2']) code will not work. This has been raised in issue #1493 and there are some workarounds if you really need a multi-level index.
EDIT:
If the original data df is really large, as in already maxing your memory, converting it to a .csvor other pandas-readable format, and build that into the scale function might be necessary, i.e:
def scale_my_df(df_filepath, scale_factor, id_num):
'''
Scales and returns a DataFrame.
'''
df_init = pd.read_csv(df_filepath)
df_scaled = df_init * scale_factor
df_scaled['id2'] = id_num
return df_scaled
And adjust the rest of the code accordingly. The idea of dask is to keep the data out of memory, but there is some overhead involved with building the computational graph and holding intermediate values.
I want to divide a dataframe into two based on a number
train = corpus.iloc[:, :10000]
test = corpus.iloc[:, 10000:]
This is the code that i am using.
I am getting the below error as :
AttributeError: iloc not found
Is iloc not part of python3? Is there any other method to split the data based on the number of records to be split?
Edit
As mentioned by the user #craig, i loc is pandas and the datatype that i have is of sparse matrix (scipy.sparse.csr.csr_matrix)
No need for the iloc, can use a row slice directly:
Pandas
import pandas as pd
df = pd.DataFrame(range(10))
df_first_half = df[:5]
df_second_half = df[5:]
Scipy
import numpy as np
from scipy.sparse import csr_matrix
x = csr_matrix((10, 3), dtype=np.int8)
x_first_half = x[:5].toarray()
x_second_half = x[5:].toarray()
If you're unfamiliar with the [5:] notation, see: https://scipy-cookbook.readthedocs.io/items/Indexing.html. Briefly, it's a one-dimensional slice (rows). Multi-dimensional slicing, e.g. [5:, :1], is also available.
The following operation
import pandas as pd
import numpy as np
data = pd.read_csv(fname,sep=",",quotechar='"')
will create a 650,000 x 9 dataframe. The first column contains dates and the following is designed to turn a single date stamp and turn it into 5 seperate features.
def timepartition(elm):
tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
return tm[0], tm[1], tm[2], tm[3], tm[4]
data["Dates"].map(timepartition)
What I would like is to assign those 5 values to a 650,000x7 np matrix.
xtrn = np.zeros(shape=(data.shape[0],7))
xtrn[:,0:4] = np.asarray(data["Dates"].map(timepartition))
#above returns error ValueError: could not broadcast input array from shape (650000) into shape (650000,4)
You might try using some of the builtin pandas features.
dates = pd.to_datetime(data['Dates'])
date_df = pd.DataFrame(dict(
year=dates.dt.year,
month=dates.dt.month,
day=dates.dt.day,
# etc.
))
xtrn[:, :5] = date_df.values # use date[['year', 'month', 'day', etc.]] if the order comes out wrong
The map function applied to a dataframe is mapping to a new series object, and by returning tuples, it will come back as an object series.
Another approach is the following.
make the following change to timepartition:
def timepartition(elm):
tm = time.strptime(elm,"%Y-%m-%d %H:%M:%S")
return [tm[i] for i in range(5)]
this will now return a listed of a tuple. The following code will create a matrix from a dataframe series that has the desired dimensions, and map it to xtrn.
xtrn[:,0:5] = = np.matrix(map(timepartition, data["Dates"].tolist()))
np matrix will infer a matrix from the nested lists from applying the partitioning function from the data to a list representation of the series, which is flat in this case.
The following worked for me. I'm not sure which method is faster, but it was easier for me to understand logically what's going on. Here my dataset "crimes" is your "data" and our time formats are a bit different.
def timepartition(elm):
tm = time.strptime(elm,"%m/%d/%Y %H:%M:%S %p")
return tm[0:5]
zeros = np.zeros(shape=(crimes.shape[0],3), dtype=np.int)
dates = np.array([timepartition(crimes["Date"][i]) for i in range(0,len(crimes))])
new = np.hstack((dates,zeros))
I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))