I have what I thought would be a straightforward thing to do in python using dask. I have a dataframe with some records in it, and I want to add a new column based on calling a function with values from two other columns as parameters.
Here is what I mean (pretend ge exists and takes two parameters):
def gc(x, y):
return ge(x, y)
def gdf(df):
func1 = np.vectorize(gc)
gh = da.from_array(func1(df.x, df.y))
df['gh'] = gh
However, I seem to get one issue or another no matter what I try to do. Currently, in the above state, I get
Number of partitions do not match (2 != 33)
It feels like I'm either going about this all wrong (like maybe I need map_blocks or map_partitions or even gufunc), or I'm missing something easy where I can set the number of partitions on my array to match that of my dataframe.
Any help would be appreciated.
It should be possible to do this with assign or map_partitions:
func1 = np.vectorize(gc)
df = df.assign(gh=lambda df: func1(df.x, df.y))
# or try this
def myfunc(df):
df['gh'] = func1(df.x, df.y)
return df
df = df.map_partitions(myfunc)
Not sure if this is a good idea after all, but having a dictionary with arrays as values, such as
DF = {'z_eu': array([127.45064758, 150.4478288 , 150.74781189, -98.3227338 , -98.25155681, -98.24993753]),
'Process': array(['initStep', 'Transportation', 'Transportation', 'Transportation', 'Transportation', 'phot']),
'Creator': array(['SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad']) }
I need to do a selection of the numeric data (z_eu) based on values of the other two keys.
One workaround I came up with so far, was to extract the arrays and iterate through them, thereby creating another array which contains the valid data.
proc = DF['Process']; z= DF['z_eu']; creat = DF['Creator']
data = [z for z,p,c in zip(z, proc,creat) if (p == 'initStep') and c=='SynRad' ]
But somehow this seems to me as effort which can be completely avoided by dealing more intelligently with the dictionary in the first place? Also, the zip() takes a long time as well.
I know that dataframes are a valid alternative but unfortunately, since I'm dealing with strings, pandas appears to be too slow.
Any hints are most welcome!
A bit simpler, using conditional slicing you could write
data = DF['z_eu'][(DF['Process'] == 'initStep') & (DF['Creator'] == 'SynRad')]
...or still using zip, you could simplify to
data = [z for z, p, c in zip(*DF.values()) if p == 'initStep' and c == 'SynRad']
Basically also conditional slicing, using a pandas DataFrame:
df = pd.DataFrame(DF)
data = df.loc[(df['Process'] == 'initStep') & (df['Creator'] == 'SynRad'), 'z_eu']
print(data)
# 0 127.450648
# Name: z_eu, dtype: float64
In principle I'd say there's nothing wrong with handling numpy arrays in a dict. You'll have a lot of flexibility and sometimes operations are more efficient if you do them straight in numpy (you could even utilize numba for purely numerical, expensive calculations) - but if that is not needed and you're fine with basically a n*m table, pandas dfs are nice and convenient.
If your dataset is large and you want to perform many look-ups as the one shown, you might not want to perform those on strings. To improve performance, you could e.g. come up with unique IDs (integers) for each 'Process' or 'Creator' from the example. You'll just need to be able to map those back to the original strings, so keep that data as well.
You can loop through one array and via the index get the right element
z_eu = DF['z_eu']
process = DF['Process']
creator = DF['Creator']
result = []
for i in range(len(z_eu)):
if process[i] == 'initStep' and creator[i] == 'SynRad':
result.append(z_eu[i])
print(result)
There is a SFrame with columns having dict elements.
import graphlab
import numpy as np
a = graphlab.SFrame({'col1':[{'oshan':3,'modi':4},{'ravi':1,'kishan':5}],
'col2':[{'oshan':1,'rawat':2},{'hari':3,'kishan':4}]})
I want to calculate cosine distance between these two columns for each row of the SFrame. Below is the operation using for loop.
dis = np.zeros(len(a),dtype = float)
for i in range(len(a)):
dis[i] = graphlab.distances.cosine(a['col1'][i],a['col2'][i])
a['distance12'] = dis
This is very inefficient and would take hours if the number of rows was large. Could someone please suggest a better approach.
You can usually avoid looping over an SFrame by using the apply function. In your case, it would look like this:
a.apply(lambda row: graphlab.distances.cosine(row['col1'], row['col2']))
That should be significantly faster than looping in Python.
I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.
import random
def normalize(data, expectation):
"""Normalize data by duplicating existing rows"""
counts = data[expectation].value_counts()
max_count = int(counts.max())
for tag, group in data.groupby(expectation, sort=False):
array = pandas.DataFrame(columns=data.columns.values)
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
i = max_count % counts[tag]
if i > 0:
array = pandas.concat([array, group.ix[random.sample(group.index, i)]])
data = pandas.concat([data, array])
return data
and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?
There are a couple of things that stand out.
To begin with, the loop
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you're doing.
Instead, perhaps you could try
pandas.concat([group] * (max_count // int(counts[tag]))
which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.
Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.
However, even if you prefer to keep the loop, I'd just append things into a list, and just concat everything at the end:
stuff = []
for tag, group in data.groupby(expectation, sort=False):
# Call stuff.append for any DataFrame you were going to concat.
pandas.concat(stuff)
I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.
If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.
Is there a convenient function for this? I've looked around but found nothing useful...
i.e. split_dataframe(df, 2(if > 10))?
I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:
size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
or as generator:
list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))
This will return the split DataFrames if the condition is met, otherwise return the original and None (which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per df and that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.
df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None
Note you can also use df.head(10) and df.tail(len(df) - 10) to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10] instead of df[:10, :] (though I like to code explicitly about the dimensions you are taking). You can can also use df.iloc and df.ix to index in similar ways.
Be careful about using df.loc however, since it is label-based and the input will never be interpreted as an integer position. .loc would only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.
But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.
There is no specific convenience function.
You'd have to do something like:
first_ten = pd.DataFrame()
rest = pd.DataFrame()
if df.shape[0] > 10: # len(df) > 10 would also work
first_ten = df[:10]
rest = df[10:]
A method based on np.split:
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]
A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).
(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)
Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:
import pandas as pd
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
#print("%s : %s" % (start, count))
dfs.append(df.iloc[start : count])
return dfs
# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])
# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]
# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]
If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:
max_rows = 4500
dataframes = []
while len(df) > max_rows:
top = df[:max_rows]
dataframes.append(top)
df = df[max_rows:]
else:
dataframes.append(df)
You could then save out these data frames:
for _, frame in enumerate(dataframes):
frame.to_csv(str(_)+'.csv', index=False)
Hope this helps someone!
def split_and_save_df(df, name, size, output_dir):
"""
Split a df and save each chunk in a different csv file.
Parameters:
df : pandas df to be splitted
name : name to give to the output file
size : chunk size
output_dir : directory where to write the divided df
"""
import os
for i in range(0, df.shape[0],size):
start = i
end = min(i+size-1, df.shape[0])
subset = df.loc[start:end]
output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
print(f"Going to write into {output_path}")
subset.to_csv(output_path)
output_size = os.stat(output_path).st_size
print(f"Wrote {output_size} bytes")
You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10
def split(df, headSize) :
hd = df.head(headSize)
tl = df.tail(len(df)-headSize)
return hd, tl
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)
The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.
Example:
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name