I have a Pandas DataFrame with many columns and I need to take the part number column and use the data in it to populate the features column. My function add_data takes the part number and looks it up in a SQL database and returns the feature notes.
I have this working with df.apply and it works well
(code block 1).
Because I recently started learning vectorization and was wondering if there was a way to do this without df.apply(code block 2)
#Working Code
df['Features'] = df.apply (lambda row: add_data(row['partnumber']), axis=1)
def add_data(row):
featurenotes = sql_lookup(row)
return featurenotes
This line of code calls my add_data function but I don't know how to grab just the value of the row in the part number column to use in my function.
df['Features'] = add_data(df['partnumber'])
I'm am very new to python so I am trying to learn best practices and how to manipulate pandas DataFrames.
Related
I have a problem with an excel file! and I want to automate it by using python script to complete a column based on the information of the first column: for example:
if data == 'G711Alaw 64k' or 'G711Ulaw 64k'
print('1-Jan) till find it == '2-Jan' then print('2-Jan') and so on.
befor automate
I need its looks like this after automate:
after automate
Is there anyone can help me to do solve this issue?
The file:
the excel file
Thanks a lot for your help.
Try this, pandas reads your jan-1 is datetime type, if you need to change it to a string you can set it directly in the code, the following code will directly assign the value read to the second column:
import pandas as pd
df = pd.read_excel("add_date_column.xlsx", engine="openpyxl")
sig = []
def t(x):
global sig
if not isinstance(x.values[0], str):
tmp_sig = x.values[0]
if tmp_sig not in sig:
sig = [tmp_sig]
x.values[1] = sig[-1]
return x
new_df = df.apply(t, axis=1)
new_df.to_excel("new.xlsx", index=False)
The concept is very simple :
If the value is date/time, copy to the [same row, next column].
If not, [same row, next column] is copied from [previous row, next
column].
You do not specifically need Python for this task. The excel formula for this would be;
=IF(ISNUMBER(A:A),A:A,B1)
Instead of checking if it is date/time, I took adavantage of the fact that the rest of the entries are alphanumeric (including both alphabets and numbers). This formula is applied on the new column.
Of course, you might already be in Python and just work within the same environment. So, here's the loop :
for i in range(len(df)):
if type(df["Orig. Codec"][i]) is datetime:
df["Column1"][i] = df["Orig. Codec"][i]
else:
df["Column1"][i] = df["Column1"][i-1]
There might be ways to lambda function for the same concept, not that I am aware of how to apply lambda and shift at the same time.
I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps
Aim: To speed up applying a function row wise across a large data frame (1.9 million ~ rows)
Attempt: Using dask map_partitions where partitions == number of cores. I've written a function which is applied to each row, creates a dict containing a variable number of new values (between 1 and 55). This function works fine standalone.
Problem: I need a way to combine the output of each function into a final dataframe. I tried using df.append, where I'd append each dict to a new dataframe and return this dataframe. If I understand the Dask Docs, Dask should then combine them to one big DF. Unfortunately this line is tripping an error (ValueError: could not broadcast input array from shape (56) into shape (1)). Which leads me to believe it's something to do with the combine feature in Dask?
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
I am not quite sure I completely understand your code in lieu of an MCVE but I think there is a bit of a misunderstanding here.
In this piece of code you take a row and a DataFrame and append one row to that DataFrame.
#Function to applied row wise down the dataframe. Takes a column (post) and new empty df.
def func(post,New_DF):
post = str(post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
New_DF = New_DF.append(scores, ignore_index=True)
return(New_DF)
Instead of appending to New_DF, I would recommend just returning a pd.Series which df.apply concatenates into a DataFrame. That is because if you are appending to the same New_DF object in all nCores partitions, you are bound to run into trouble.
#Function to applied row wise down the dataframe. Takes a row and returns a row.
def tobsecret_func(row):
post = str(row.post)
scores = OtherFUNC.countWords(post)
scores['post'] = post
length_adjusted_series = pd.Series(scores).reindex(range(55))
return(length_adjusted_series)
Your error also suggests that as you wrote in your question, your function creates a variable number of values. If the pd.Series you return doesn't have the same shape and column names, then df.apply will fail to concatenate them into a pd.DataFrame. Therefore make sure you return a pd.Series of equal shape each time. This question shows you how to create pd.Series of equal length and index: Pandas: pad series on top or bottom
I don't know what kind of dict your OtherFUNC.countWords returns exactly, so you may want to adjust the line:
length_adjusted_series = pd.Series(scores).reindex(range(55))
As is, the line would return a Series with an index 0, 1, 2, ..., 54 and up to 55 values (if the dict originally had less than 55 keys, the remaining cells will contain NaN values).
This means after applied to a DataFrame, the columns of that DataFrame would be named 0, 1, 2, ..., 54.
Now you take your dataset and map your function to each partition and in each partition you apply it to the DataFrame using apply.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(
lambda df : df.apply(
lambda x : func(x.post,New_DF),axis=1)).\
compute(get=get)
map_partitions expects a function which takes as input a DataFrame and outputs a DataFrame. Your function is doing this by using a lambda function that basically calls your other function and applies it to a DataFrame, which in turn returns a DataFrame. This works but I highly recommend writing a named function which takes as input a DataFrame and outputs a DataFrame, it makes it easier for you to debug your code.
For example with a simple wrapper function like this:
df_wise(df):
return df.apply(tobsecret_func)
Especially as your code gets more complex, abstaining from using lambda functions that call non-trivial code like your custom func and instead making a simple named function can help you debug because the traceback will not just lead you to a line with a bunch of lambda functions like in your code but will also directly point to the named function df_wise, so you will see exactly where the error is coming from.
#Dask
dd.from_pandas(dataset,npartitions=nCores).\
map_partitions(df_wise,
meta=df_wise(dd.head())
).\
compute(get=get)
Notice that we just fed dd.head() to df_wise to create our meta-keyword which is similar to what Dask would do under the hood.
You are using dask.get, the synchronous scheduler which is why the whole New_DF.append(...) code could work, since you append to the DataFrame for each consecutive partition.
This does not give you any parallelism and thus will not work if you use one of the other schedulers, all of which parallelise your code.
The documentation also mentions the meta keyword argument, which you should supply to your map_partitions call, so dask knows what columns your DataFrame will have. If you don't do this, dask will first have to do a trial run of your function on one of the partitions and check what the shape of the output is before it can go ahead and do the other partitions. This can slow down your code by a ton if your partitions are large; giving the meta keyword bypasses this unnecessary computation for dask.
I'm having trouble in correctly executing a for loop through my dataframe in python.
Basically, for every row in the dataframe (df_weather), the code should select one value each from the column no. 13 and 14 and execute a function which is defined earlier in the code. Eventually, I require the calculated value in each row to be summed to give me one final answer.
The error being returned is as follows: "string indices must be integers"
Request anyone to help me through this step. The code for the same is provided below.
Thanks!
stress_rate = 0
for i in df_weather:
b = GetStressDampHeatParameterized(i[:,13], i[:,14])
stress_rate = b + stress_rate
print(stress_rate)
This can be solved in a single line:
print sum(df.apply(lambda row: func(row[14], row[15]), axis=1))
Where func is your desired function and axis=1 ensures that the function is applied on each row as opposed to each column (which is the default).
My solution first creates a temporary series (picture: an unattached column) that is constructed by applying a function to each row in turn. The function that is actually being applied is an anonymous function indicated by the keyword lambda, which takes a single input row and which is fed a single row at a time from the apply method. That anonymous function simply calls your function func and passes the two column values in the row.
A Series can be summed using the sum function.
Note the indexing of the columns starts at 0.
Also note, saying for x in df: will iterate over the columns.
your number one problem is the following line:
for i in df_weather: This line is actually yielding you the column titles and not the rows themselves. What you're looking for is actually the following:
for i in df_weather.values():. The values will return a numpy array that you could itterate. The problem though is that the variable i will be a single row in the matrix now.
I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)