I am a beginner at python and dataframes and now I have encountered a problem. I have made a dataframe containing adresses. I want to create a calculated column that shows the distance from my house. I have gotten this far:
API_key = 'Very secret api key'
gmaps = googlemaps.Client(key=API_key)
def distance(destination):
origin = ('44100, Kuhnamontie 1, Finland')
distance = (gmaps.distance_matrix(origin, destination, mode='driving')["rows"][0]["elements"][0]["distance"]["value"])/1000
return distance
df["distance"] = distance(df.Adress)
This solution works for the first row in the dataframe but all rows in the column get the same value assigned. I guess the calculation and api-request must be made per row. I guess I could loop through the data frame, but as I understand it - there are better ways.
Can you help me?
Using pandas apply should be slightly more efficient than loop:
df["distance"] = df.apply(lambda row : distance(row["Adress"]),axis=1))
Related
I am currently working with an NFL dataset. The dataset is very granular, but very straightforward, each row represents the position of a player (x,y coordinates relative to the field) for a given frame, for a given play, in a given game. You can think of a frame as snapshot in time, where at that point, we record the coordinates of all the players on the field, and input them as one row per player in the dataframe. Each play has ~70 frames, and each game has ~80 plays, and we have 250+ games
What I want, is to identify, for certain players on offense (specifically wide receivers), who their closest defender is, and how far away they are. So ideally, I would apply a function that takes in a frame, and outputs two columns, populated only for the wide receivers - the two columns being closest defenders and their distance to the WR. I would use that function over all frames, within all plays, within all games.
I am struggling to come up with an efficient solution. I use the distance_matrix function from scipy to calculate the two metrics, but what's most difficult right now is iterating through the 1MM+ combinations of game, play and frames.
I was thinking of maybe using the apply function to get to a result, but it would still involve iterating through the various combinations of games, play and frame. I'm thinking maybe there's even a vectorized solution but i can't come up with anything.
Any advice here would be immensly helpful - I've pasted my current working code below, which just uses for loops and takes a really long time
temp = pd.DataFrame()
# For each game, and each play within the games, and frames within the play
for game_id in test.gameId.unique():
for play_id in test[test.gameId==game_id].playId.unique():
for frame_id in test[(test.playId==play_id)&(test.gameId==game_id)].frameId.unique():
print("Game: {} | Play: {} | Frame: {}".format(game_id,play_id,frame_id))
# Filter the dataframe on a given frame, within a given play, within a given game
df = test[(test.gameId==game_id)&
(test.playId==play_id)&
(test.frameId==frame_id)
]
# Isolate the wide receivers
df_wr = df[(df["inPoss"]==1)&(df['position']=="WR")]
# Isolate the defenders
df_d = df[df["inPoss"]==0]
# Calculate the distance matrix between each WR and defenders
dm = distance_matrix(df_wr[['x','y']].values,
df_d[['x','y']].values)
# use argmin and min to record the closest defender, and their distance
closest_defender = dm.argmin(axis=1)
closest_defender_distance = dm.min(axis=1)
# Create a dataframe to record the information
for i,j in enumerate(closest_defender):
temp_df = pd.DataFrame({
'gameId':[game_id]
,'playId':[play_id]
,'frameId':[frame_id]
,'displayName':[df_wr.displayName.iloc[i]]
,'closestDefender':[df_d.displayName.iloc[j]]
,'closestDefenderDistance':[closest_defender_distance[i]]
})
temp = pd.concat([temp, temp_df])
Obviously I don't have any data so I can't test my code robustly. But there are some guiding principles that I can elucidate on.
You do not want to be doing so much subsetting. To avoid this, you can groupby game, play, and frame:
for g, grouped_df in test.groupby(['gameId', 'playId', 'frameId']):
... # do your isolation stuff here
This also means you don't need to do your subsetting while continuing to use the code you're already using right now. If you were to do this, you shouldn't constantly be concatenating to your existing data frame. Instead, create a list of data frames and concatenate at the end. Ie:
temp = []
for ... in ...:
result_df = ... # how you produce the result
temp.append(result_df)
final = pd.concat(temp, axis='rows')
You also could reduce your entire thing into a function which you then apply over the groupby. The function would have signature:
def complex_function(df):
... # it can return multiple columns and rows as well
result = test.groupby(['gameId', 'playId', 'frameId']).apply(complex_function)
Returning a data frame in your groupby.apply here is somewhat tricky. The index of the returned data frame is broadcast to your result index and may require resetting or flattening. The columns are broadcasted properly, however.
I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.
I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?
I have deleted some calculation steps since they are quite long and make code part even less readable
#This function adds new column to the dataframe
def NewColdfConverter(*args):
for i in args:
dfConverter[i] = '' #previously used dfConverter[i] = NaN
#This function creates dataframe from excel file
def DataFrameCreator(path,sheetname):
excelFile = pd.ExcelFile(path)
global readExcel
readExcel = pd.read_excel(excelFile,sheet_name=sheetname)
#calling my function to create dataframe
DataFrameCreator(filePath,sheetName)
dfConverter = pd.DataFrame(readExcel)
#dropping NA values from Orders column (right now called Unnamed)
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)
#dropping rows and deleting other unnecessary columns
dfConverter.drop(dfConverter.head(1).index, inplace=True)
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)
#renaming columns from Unnamed 1: etc to proper names
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})
#calling new column function -> this Day column appears in the 1st for loop
NewColdfConverter("Day")
#example for loop that worked prior, but not working since new dataset and new header/column steps added:
for i in range(len(dfConverter)):
#Day column-> floor Entry Date -1, if time is less than 5:00:00
if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)
else:
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)
#example code of getting reqsWoSetUpValue
for i in range(len(dfSorted)):
reqsWoSetUpValue[i] = #calculationsteps...
#inserting column with value
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)
#getting requirements value with previously calculated Reqs wo SetUp column
for i in range(len(dfSorted)):
requirementsValue[i] = #calc
dfSorted.insert(50,'Requirements',requirementsValue)
#Calculating Other Reqs value with previously calculated Requirements column.
for i in range(len(dfSorted)):
otherReqsValue[i] = #calc
dfSorted.insert(51,'Other Reqs',otherReqsValue)
Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:
NewColdfConverter('Reqs wo setup','Requirements','Other reqs')
#then in 1 for loop:
for i in range(len(dfsorted)):
dfSorted['Reqs wo setup'] = #calculationsteps
dfSorted['Requirements'] = #calculationsteps
dfSorted['Other reqs'] = #calculationsteps
Thank you
General comment: How to identify bottlenecks
To get started, you should try to identify which parts of the code are slow.
Method 1: time code sections using the time package
Wrap blocks of code in statements like this:
import time
t = time.time()
# do something
print("time elapsed: {:.1f} seconds".format(time.time() - t))
Method 2: use a profiler
E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.
Vectorize your operations
Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.
For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once
# slow (don't do this):
for i in range(len(dfConverter)):
dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])
# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])
If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:
mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)
Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()
for index, data in dfSorted.iterrows():
dfSorted['Reqs wo setup'][index] = #calculationsteps
dfSorted['Requirements'][index] = #calculationsteps
dfSorted['Other reqs'][index] = #calculationsteps
My problem is simple. I have a pandas dataframe with 124957 different tweets (related to a center-topic). The problem is that each date has more than 1 tweet (around 300 per day).
My goal is to perform sentiment analysis on the tweets of each day. In order to solve this, I am trying to combine all tweets of the same day into one string (which corresponds to each date).
To achieve this, I have tried the following:
indx=0
get_tweet=""
for i in range(0,len(cdata)-1):
get_date=cdata.date.iloc[i]
next_date=cdata.date.iloc[i+1]
if(str(get_date)==str(next_date)):
get_tweet=get_tweet+cdata.text.iloc[i]+" "
if(str(get_date)!=str(next_date)):
cdata.loc[indx,'date'] = get_date
cdata.loc[indx,'text'] = get_tweet
indx=indx+1
get_tweet=" "
df.to_csv("/home/development-pc/Documents/BTC_Tweets_1Y.csv")
My problem is that only a small sample of the data is actually converted to my format of choice.
Image of the dataframe
I do not know whether it is of importance, but the dataframe consists of three separate datasets that I combined into one using "pd.concat". After that, I sorted the newly created dataframe by date (ascending order) and reset the index as it was reversed (last input (2020-01-03) = 0 and first input (2019-01-01) = 124958).
Thanks in advance,
Filippos
Without going into the loop you used (think you are only concatating two first instances, not sure) you could use groupby and apply, here is an example:
# create some random data for example
import pandas as pd
import random
import string
dates = random.choices(pd.date_range(pd.Timestamp(2020,1,1), pd.Timestamp(2020,1,6)),k=11)
letters = string.ascii_lowercase
texts = [' '.join([''.join(random.choices(letters, k=random.randrange(2,10))) for x in
range(random.randrange(3,12))]) for x in range(11)]
df = pd.DataFrame({'date':dates, 'text':texts})
# group
pd.DataFrame(df.groupby('date').apply(lambda g: ' '.join(g['text'])))
I have a csv file "qwi_ak_se_fa_gc_ns_op_u.csv" which contains a lot of observations of 80 variables. One of them is geography which is the county. Every county belongs to something called a Commuting Zone (CZ). Using a matching table given in "czmatcher.csv" I can assign a CZ to every county given in geography.
The code below shows my approach. It is simply going through every row and finding its CZ by going through the whole "czmatcher.csv" row and finding the right one. Then i proceed to just copy the values using .loc. The problem is, this took over 10 hours to run on a 0.5 GB file (2.5 million rows) which isn't that much and my intuition says this should be faster?
This picture illustrates the way the csv files look like. The idea would be to construct the "Wanted result (CZ)" column, name it CZ and add it to the dataframe.
File example
import pandas as pd
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']
Is there a faster way of doing this?
The best way to do this is a left merge on your dataframes,
data = pd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
czm = pd.read_csv("czmatcher.csv")
I assume that in both dataframes the column country is spelled the same,
data_final = data.merge(czm, how='left', on = 'country')
If it isn't spelled the same way you can rename your columns,
data.rename(columns:{col1:country}, inplace=True)
read the doc for further information https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
In order to make it faster, but not reworking your whole solution I would recommend to use Dask DataFrames, to say it simple, Dask divides your reads your csv in chunks and processes each of them in parallel. After reading csv. you can use .compute method to get pandas df instead of Dask df.
This will look like this:
import pandas as pd
import dask.dataframe as dd #IMPROT DASK DATAFRAMES
# YOU NEED TO USE dd.read_csv instead of pd.read_csv
data = dd.read_csv("qwi_ak_se_fa_gc_ns_op_u.csv")
data = data.compute()
czm = dd.read_csv("czmatcher.csv")
czm = czm.compute()
sLength = len(data['geography'])
data['CZ']=0
#this is just to fill the first value
for j in range(0,len(czm)):
if data.loc[0,'geography']==czm.loc[0,'FIPS']:
data.loc[0,'CZ'] = czm.loc[0,'CZID']
#now fill the rest
for i in range(1,sLength):
if data.loc[i,'geography']==data.loc[i-1,'geography']:
data.loc[i,'CZ'] = data.loc[i-1,'CZ']
else:
for j in range(0,len(czm)):
if data.loc[i,'geography']==czm.loc[j,'FIPS']:
data.loc[i,'CZ'] = czm.loc[j,'CZID']