I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
from scipy.sparse import coo_matrix, csc_matrix, hstack
def df2tdm(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
'''
tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
return tdm_df
Which returns the TDM as a DataFrame, for example:
df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
print df.head()
tdm_df = df2tdm(df,'title','page')
tdm_df.head()
boiled delicious egg else fried orange potato salad something \
0 1 1 1 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 1 1 0
3 0 0 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 0 1
split page
0 0 1
1 0 1
2 0 2
3 1 3
4 0 4
This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.
I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower
def df2tdm_sparse(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn. This implementation uses sparse DataFrames.
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
'''
pm = df[[placementColumn]].values
tm = countvec.fit_transform(df[titleColumn])#.toarray()
m = csc_matrix(hstack([pm,tm]))
dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
dfout.columns = [placementColumn]+countvec.get_feature_names()
return dfout
Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here
I also think that the problem might be with the conversion from sparse matrix to sparse data frame.
try this function (or something similar)
def SparseMatrixToSparseDF(xSparseMatrix):
import numpy as np
import pandas as pd
def ElementsToNA(x):
x[x==0] = NaN
return x
xdf1 =
pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel()))
for i in np.arange(xSparseMatrix.shape[0]) ])
return xdf1
you can see that it reduces the size by using function density
df1.density
I hope it helps
Related
I have a dataset of users and locations they are affiliated with, binary encoded as columns.
I'd like to visualize each user on a 2-d axis based on the similar of their affiliated locations. The closer they are in the vector space, the more similar they are in terms of locations they're affiliated with.
Here is an example of what I'd like to create...where each dot represents a user and they are closer or further based on their location profile.
I am trying to think through methods to collapse the location information (many dimensions) into 2 dimensions.
The ask:
Are there any techniques that are well suited for this problem?
A few ideas so far:
1) PCA (or similar): Conduct dimensionality reduction via PCA with an eye for techniques that work with binary features (looking into Kernal PCA)
2) Neural Network Embedding: Apply techniques similar to word embeddings to this problem. Create an embedded layer where each user is translated into a continuous vector space (which can be reduced down to 2 dimensions).
Reproducible data below. The actual dataset is ~5k users and 50 locations, but I'd like to solution to be scalable.
import names
import pandas as pd
import numpy as np
names_list = []
for i in range(1,100):
single_name = names.get_full_name()
names_list.append(single_name)
df = pd.DataFrame(names_list,columns=['Names'])
df['Var1'] = np.random.randint(0,2, size=len(df))
df['Var2'] = np.random.randint(0,2, size=len(df))
df['Var100'] = np.random.randint(0,2, size=len(df))
print(df)
#sample data
Names Var1 Var2 Var100
0 Clayton Stocks 1 1 1
1 Gary Beavers 0 0 1
2 Kristal Feagin 0 1 1
3 Crystal Barb 0 0 1
4 William Wilburn 1 0 0
.. ... ... ... ...
94 Jennifer Cool 0 0 0
95 Roberta Larsen 0 0 0
96 Malcom Mosley 1 0 1
97 Hazel Wilkins 1 1 0
98 Chanell Jaremka 1 0 1
I want to count the number of common elements between rows (each row has 6 elements/ columns)
My dataframe (df) looks something like this:
>>> df
Customer Number Most Frequent Called 1 Most Frequent Called 2 Most Frequent Called 3 Most Frequent Called 4 Most Frequent Called 5
0 552711620 161359852 611336215 884140437 804548991 135953430
1 561712520 186359312 666336115 855140357 899548041 134953530
2 331112180 316659812 436926115 545220357 117748041 984213530
3 873212120 196357673 331112180 565777359 174348053 554212940
4 113219540 733352993 975632166 569117345 175888077 364212923
...
I have tried this code:
connection_df = pd.DataFrame()
for i in range(len(df)):
connection_list = []
for j in range(len(df)):
intersection = set(df.iloc[i]).intersection(df.iloc[j])
connection_list.append(len(intersection))
connection_df.insert(loc=i, column = str(i), value = connection_list)
This will give me a dataframe of a form of a matrix like this:
>>> connection_df
0 1 2 3 4
0 6 0 0 0 0
1 0 6 0 0 0
2 0 0 6 1 0
3 0 0 1 6 0
4 0 0 0 0 6
This piece of code does what I want, but as I'm using loops, they are very inefficient. Potentially there will be millions of rows so I want to ask for any suggestions on optimizing these codes. Thanks.
An efficient solution consist in performing all the operation with Numpy (by converting the whole dataframe to a Numpy matrix), computing only the upper part of the matrix as the intersection between two sets is symmetric, and pre-computing all the sets.
def fastConnectionDf(df):
size = len(df)
connection_mat = np.zeros((size, size), dtype=np.int)
df_mat = df.to_numpy()
uniqueSets = [np.unique(df_mat[i]) for i in range(size)] # Precompute all the sets
for i in range(size):
connection_mat[i,i] = len(uniqueSets[i])
for j in range(i+1, size):
intersection = np.intersect1d(uniqueSets[i], uniqueSets[j], assume_unique=True)
connection_mat[i,j] = len(intersection)
connection_mat = np.maximum(connection_mat, connection_mat.T)
connection_df = pd.DataFrame(connection_mat)
return connection_df
On my machine, this solution is 28 times faster on the example dataframe (and up to 50 times faster on bigger dataframes).
Note that it is possible to improve the algorithm by:
just counting the number of intersecting elements rather than creating an array with all the items
using a more clever implementation can sort the arrays before to speed up the set intersections (see np.searchsorted)
using Numba to speed up the computation on big dataframes
The two first improvements are hard (impossible?) to perform efficiently only with Numpy, but possible with Numba although this is a bit complex to do.
I have a pandas Data Frame which is a 50x50 correlation matrix. In the following picture you can see what I have as an example
What I would like to do, if it's possible of course, is to make a new data frame which has only the elements of the old one that are higher than 0.5 or lower than -0.5, indicating a strong linear relationship, but not 1, to avoid the variance parts.
I dont think what I ask is exactly possible because of course variable x0 wont have the same strong relationships that x1 have etc, so the new data frame wont be looking very good.
But is there any way to scan fast through this data frame, find the values I mentioned and maybe at least insert them into an array?
Any insight would be helpful. Thanks
you can't really look at a correlation matrix if you want to drop correlation pairs that are too low. One thing you could do is stack the frame and keep the relevant correlation pair.
having (randomly generated as an example):
0 1 2 3 4
0 0.038142 -0.881054 -0.718265 -0.037968 -0.587288
1 0.587694 -0.135326 -0.529463 -0.508112 -0.160751
2 -0.528640 -0.434885 -0.679416 -0.455866 0.077580
3 0.158409 0.827085 0.018871 -0.478428 0.129545
4 0.825489 -0.000416 0.682744 0.794137 0.694887
you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 5)))
df = df.stack()
df = df[((df > 0.5) | (df < -0.5)) & (df != 1)]
0 1 -0.881054
2 -0.718265
4 -0.587288
1 0 0.587694
2 -0.529463
3 -0.508112
2 0 -0.528640
2 -0.679416
3 1 0.827085
4 0 0.825489
2 0.682744
3 0.794137
4 0.694887
I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))
I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316