myfunc does some processes on a dataframe. I am trying to reduce computational time by vectorzing myfunc. Each dataframe are created by reading a very large text file(30 Gigs). I tried to create array of dataframes and then vectorize myfunc so that it can apply on array of dataframes, but the problem is that np.vectorize applies on each cell of a dataframe not on whole dataframe. Even though, I get some columns of dataframe as an array, np.vectorize applies myfunc on each cell inside an array not on whole array. I am not sure that is right way to solve this problem. Please share your thoughts. Thank you.
import numpy as np
import pandas as pd
def myfunc(a):
# Do some process on dataframe
return a*2
vecfunc = np.vectorize(myfunc)
x = pd.DataFrame(np.array([[1,2,3],[1,2,3]]))
y = pd.DataFrame(np.array([[1,2,3],[1,2,3]]))
z = pd.DataFrame(np.array([[1,2,3],[1,2,3]]))
result = vecfunc([x,y,z])
print(result)
Related
I have a Pandas Dataframe (dataset, 889x4) and a Numpy ndarray (targets_one_hot, 889X29), which I want to concatenate. Therefore, I want to convert the targets_one_hot into a Pandas Dataframe.
To do so, I looked at several suggestions. However, these suggestions are about smaller arrays, for which it is okay to write out the different columns.
For 29 columns, this seems inefficient. Who can tell me efficient ways to turn this Numpy array into a Pandas DataFrame?
We can wrap a numpy array in a pandas dataframe, by passing it as the first parameter. Then we can make use of pd.concat(..) [pandas-doc] to concatenate the original dataset, and the dataframe of the target_one_hot into a new dataframe. Since we here concatenate "vertically", we need to set the axis parameter on axis=1:
pd.concat((dataset, pd.DataFrame(targets_one_hot)), axis=1)
I would like to ask for your help. The problem in steps.
1. Import two excel files into Python Data frames - so far no problem
2. Transferring the data frames into numpy arrays.
3. Create a VLOOKUP function in python, with the arrays. Both arrays have a key in the first column, which is unique and can be used for matching. The two tables include data, which is correct in one table but not in the other one. I would like to overwrite values in the table where values are wrong from the table where values are correct (I know, which table is has the right values...)
Is there a more numpy way to do it;
So far the code I wrote:
import pandas as pd
df=pd.DataFrame()
s = pd.read_excel("C:\a.xlsx")
r = pd.read_excel("C:\b.xlsx")
z=s.values
t = r.values
Here matching the two arrays, and overwriting the value
for i in z:
for j in t:
if z[i, 0] == t[j, 0]:
t[i, 41] = z[j, 5]
If same length, use pd.merge, it acts like vlookup:
newdf = s.merge(r, on ='same_key')
newdf will have all the columns from both data frames. You can now access the individual columns you need to update:
newdf['wrongcolumn'] = newdf['rightcolumn']
So basically im trying to transform a list into a DataFrame.
Here are the two ways of doing it that I am trying, but I cannot come up to a good performance benchmark.
import pandas as pd
mylist = [1,2,3,4,5,6]
names = ["name","name","name","name","name","name"]
# Way 1
pd.DataFrame([mylist], columns=names)
# Way 2
pd.DataFrame.from_records([mylist], columns=names)
I also tried dask but I did not find anything that could work for me.
so I just made up an example with 10 columns and random integers in the range of 1 Million values in those, i got the maximum result very quickly. Does this give you maybe a start to work with dask? They proposed the approach here which is also related to this question.
import dask.dataframe as dd
from dask.delayed import delayed
import pandas as pd
import numpy as np
# Create List with random integers
list_large = [np.random.random_sample(int(1e6))*i for i in range(10)]
# Convert it to dask dataframe
dfs = [delayed(pd.DataFrame)(i) for i in list_large]
df = dd.from_delayed(dfs)
# Calculate Maximum
max = df.max().compute()
I have to write an object that takes either a pandas data frame or a numpy array as the input (similar to sklearn behavior). In one of the methods for this object, I need to select the columns (not a particular fixed one, I get a few column indices based on other calculations).
So, to make my code compatible with both input types, I tried to find a common way to select columns and tried methods like X[:,0](doesn't work on pandas dataframes), X[0] and others but they select differently. Is there a way to select columns in a similar fashion across pandas and numpy?
If no then how does sklearn work across these data structures?
You can use an if condition within your method and have separate selection methods for pandas dataframes and numpy arrays. Given sample code below.
def method_1(self, var, col_indices):
if isinstance(var, pd.DataFrame):
selected_columns = var[var.columns[col_indices]]
else:
selected_columns = var[:,col_indices]
Here, var is your input which can be a numpy array or pandas dataframe, col_indices are the indices of the columns you want to select.
If apply funtion to calculate logaritm at single column of large dataset using Dask, How can I do that?
df_train.apply(lambda x: np.log1p(x), axis=1 , meta={'column_name':'float32'}).compute()
The dataset is very large (125 Millions of rows), How can I do that?
You have a few options:
Use dask.array functions
Just like how your pandas dataframe can use numpy functions
import numpy as np
result = np.log1p(df.x)
Dask dataframes can use dask array functions
import dask.array as da
result = da.log1p(df.x)
Map Partitions
But maybe no such dask.array function exists for your particular function. You can always use map_partitions, to apply any function that you would normally do on pandas dataframes across all of the pandas dataframes that make up your dask dataframe
Pandas
result = f(df.x)
Dask DataFrame
result = df.x.map_partitions(f)
Map
You can always use the map or apply(axis=0) methods, but just like in Pandas these are usually very bad for performance.