I am new to python and data science, and I am currently working on a project that is based on a very large dataframe, with 75 columns. I am doing some data exploration and I would like to check for possible correlations between the columns. For smaller dataframes I know I could use pandas plotting.scatter_matrix() on the dataframe in order to do so. However, in my case this produces a 75x75 matrix -- and I can't even visualize the individual plots.
An alternative would be creating lists of 5 columns and using scatter_matrix multiple times, but this method would produce too many scatter matrices. For instance, with 15 columns this would be:
import pandas as pd
df = pd.read_csv('dataset.csv')
list1 = [df.iloc[:, i] for i in range(5)]
list2 = [df.iloc[:, i+5] for i in range(5)]
list3 = [df.iloc[:, i+10] for i in range(5)]
pd.plotting.scatter_matrix(df_acoes[list1])
pd.plotting.scatter_matrix(df_acoes[list2])
pd.plotting.scatter_matrix(df_acoes[list3])
In order to use this same method with 75 columns, I'd have to go on until list15. This looks very inefficient. I wonder if there would be a better way to explore correlations in my dataset.
The problem here is to a lesser extend the technical part. The production of the plots (in number 5625) will take quite a long time. Additionally, the plots will take a bit of memory.
So I would ask a few questions to get around the problems:
Is it really necessary to have all these scatter plots?
Can I reduce the dimensional in advance?
Why do I have such a high number of dimensions?
If the plots are really useful, You could produce them by your own and stick them together, or wait until the function is ready.
Related
I am using zero shot classification to label large amounts of data. I have written a simple function to assist me with this and am wondering if there is a better way for this to run. My current logic was to take the highest score and label and append this label into a dataframe.
def labeler(input_df,output_df):
labels = ['Fruit','Vegetable','Meat','Other']
for i in tqdm(range(len(input_df))):
temp = classifier(input_df['description'][i],labels)
output ={'work_order_num':input_df['order_num'][i],
'work_order_desc':input_df['description'][i],
'label':temp['labels'][0],
'score':temp['scores'][0]}
output_df.append(output)
In terms of speed and resources would it be better to shape this function with lambda?
Your problem boils down to iteration over the pandas dataframe input_df. Doing that with a for loop is not the most efficient way (see: How to iterate over rows in a DataFrame in Pandas).
I suggest doing something like this:
output_df['work_order_num', 'work_order_desc'] = input_df['order_num', 'description'] # these columns can be copied as whole.
def classification(df_desc):
temp = classifier(df_desc, labels)
return temp['labels'][0], temp['scores'][0]
output_df['label'], output_df['score'] = zip(*input_df.apply(classification))
classification function returns tuples of values that need to be unpacked so I used the zip trick from this question.
Also, building a dataframe by concatenation is a very slow process too. So with the solution above you omit two potentially prohibitively slow operations: slow for-loop and appending rows to a dataframe.
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.
I have a number of parquet files, where all of the chunks together are too big to fit into memory. I would like to load them into a dask dataframe, compute some results (cumsum) and then display the cumsum as a plot. For this reason I wanted to select equally spaced subset of data (some k rows) from the cumsum row, and then plot this subset. How would I do that?
You could try:
slices = 10 # or whatever
slice_point = int(df.shape[0]/slices)
for i in range(slices):
current_sliced_df = df.loc[i*slice_point:(i+1)*slice_point]
and do whatever you want with the current slice
I think that using df[serie].sample(...)(doc) would allow you to avoid to code the way of selecting a representative subset of rows.
Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png
following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.
I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')