Converting degree minute second (DMS) columns to decimal degrees in pandas dataframe - python

I have a data frame like the below and would like to convert the Latitude and Longitude columns in Degree, Minute, Second format into decimal degrees and want the updated table with other column
any help would be appreciated

Here is the relevant code that uses apply, lambda to process each row of the dataframe and creates a new column lat_decimal to contain the result.
# Create dataframe
d6 = {'id':['a1','a2','a3'],
'lat_deg': [10, 11, 12],
'lat_min': [15, 30, 45],
'lat_sec': [10, 20, 30]
}
df6 = pd.DataFrame( data=d6 )
df6["lat_decimal"] = df6[["lat_deg","lat_min","lat_sec"]].apply(lambda row: row.values[0] + row.values[1]/60 + row.values[2]/3600, axis=1)
The resulting dataframe:-
id lat_deg lat_min lat_sec lat_decimal
0 a1 10 15 10 10.252778
1 a2 11 30 20 11.505556
2 a3 12 45 30 12.758333

Related

Finding the summation of values from two pandas dataframe column

I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Fastest way to calculate in Pandas?

Given these two dataframes:
df1 =
Name Start End
0 A 10 20
1 B 20 30
2 C 30 40
df2 =
0 1
0 5 10
1 15 20
2 25 30
df2 has no column names, but you can assume column 0 is an offset of df1.Start and column 1 is an offset of df1.End. I would like to transpose df2 onto df1 to get the Start and End differences. The final df1 dataframe should look like this:
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 Start_Diff_2 End_Diff_2
0 A 10 20 5 10 -5 0 -15 -10
1 B 20 30 15 20 5 10 -5 0
2 C 30 40 25 30 15 20 5 10
I have a solution that works, but I'm not satisfied with it because it takes too long to run when processing a dataframe that has millions of rows. Below is a sample test case to simulate processing 30,000 rows. As you can imagine, running the original solution (method_1) on a 1GB dataframe is going to be a problem. Is there a faster way to do this using Pandas, Numpy, or maybe another package?
UPDATE: I've added the provided solutions to the benchmarks.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Original
def method_1():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Store data for new columns in a dictionary
new_columns = {}
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
key_start = 'Start_Diff_' + str(index2)
key_end = 'End_Diff_' + str(index2)
if (key_start in new_columns):
new_columns[key_start].append(row1[1]-row2[0])
else:
new_columns[key_start] = [row1[1]-row2[0]]
if (key_end in new_columns):
new_columns[key_end].append(row1[2]-row2[1])
else:
new_columns[key_end] = [row1[2]-row2[1]]
# Add dictionary data as new columns
for key, value in new_columns.items():
df1[key] = value
# jezrael - https://stackoverflow.com/a/60843750/452587
def method_2():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Convert selected columns to 2d numpy array
a = df1[['Start', 'End']].to_numpy()
b = df2[[0, 1]].to_numpy()
# Output is 3d array; convert it to 2d array
c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1)
# Generate columns names and with DataFrame.join; add to original
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
# sammywemmy - https://stackoverflow.com/a/60844078/452587
def method_3():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Create numpy arrays of df1 and df2
df1_start = df1.loc[:, 'Start'].to_numpy()
df1_end = df1.loc[:, 'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
# Use np tile to create shapes that allow elementwise subtraction
tiled_start = np.tile(df1_start, (len(df2), 1)).T
tiled_end = np.tile(df1_end, (len(df2), 1)).T
# Subtract df2 from df1
start = np.subtract(tiled_start, df2_start)
end = np.subtract(tiled_end, df2_end)
# Create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
# Create dataframes of start and end
start_df = pd.DataFrame(start, columns=start_columns)
end_df = pd.DataFrame(end, columns=end_columns)
# Lump start and end into one dataframe
lump = pd.concat([start_df, end_df], axis=1)
# Sort the columns by the digits at the end
filtered = lump.columns[lump.columns.str.contains('\d')]
cols = sorted(filtered, key=lambda x: x[-1])
lump = lump.reindex(cols, axis='columns')
# Hook lump back to df1
df1 = pd.concat([df1,lump],axis=1)
print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
Output:
Method 1: 50.506279182
Method 2: 0.08886280600000163
Method 3: 0.10297686199999845
I suggest use here numpy - convert selected columns to 2d numpy array in first step::
a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()
Output is 3d array, convert it to 2d array:
c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[ 5 10 -5 0 -15 -10]
[ 15 20 5 10 -5 0]
[ 25 30 15 20 5 10]]
Last generate columns names and with DataFrame.join add to original:
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 \
0 A 10 20 5 10 -5 0
1 B 20 30 15 20 5 10
2 C 30 40 25 30 15 20
Start_Diff_2 End_Diff_2
0 -15 -10
1 -5 0
2 5 10
Don't use iterrows(). If you're simply subtracting values, use vectorization with Numpy (Pandas also offers vectorization, but Numpy is faster).
For instance:
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
col_names = "Start_Diff_1 End_Diff_1".split()
df3 = pd.DataFrame(df2.to_numpy() - 10, columns=colnames)
Here df3 equals:
Start_Diff_1 End_Diff_1
0 -5 0
1 5 10
2 15 20
You can also change column names by doing:
df2.columns = "Start_Diff_0 End_Diff_0".split()
You can use f-strings to change column names in a loop, i.e., f"Start_Diff_{i}", where i is a number in a loop
You can also combine multiple dataframes with:
df = pd.concat([df1, df2],axis=1)
This is one way to go about it:
#create numpy arrays of df1 and 2
df1_start = df1.loc[:,'Start'].to_numpy()
df1_end = df1.loc[:,'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
#use np tile to create shapes
#that allow element wise subtraction
tiled_start = np.tile(df1_start,(len(df2),1)).T
tiled_end = np.tile(df1_end,(len(df2),1)).T
#subtract df2 from df1
start = np.subtract(tiled_start,df2_start)
end = np.subtract(tiled_end, df2_end)
#create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
#create dataframes of start and end
start_df = pd.DataFrame(start,columns=start_columns)
end_df = pd.DataFrame(end, columns = end_columns)
#lump start and end into one dataframe
lump = pd.concat([start_df,end_df],axis=1)
#sort the columns by the digits at the end
filtered = final.columns[final.columns.str.contains('\d')]
cols = sorted(filtered, key = lambda x: x[-1])
lump = lump.reindex(cols,axis='columns')
#hook lump back to df1
final = pd.concat([df1,lump],axis=1)

Finding duplicates in a column, setting conditions, summing values from another column

I have a csv file and I'm currently using pandas module. Have not found the solution for my problem. Here is the sample, problem, and desired output csv.
Sample csv:
project, id, sec, code
1, 25, 50, 01
1, 25, 50, 12
1, 25, 45, 07
1, 5, 25, 03
1, 25, 20, 06
Problem:
I do not want to get rid of duplicated (id) but sum the values of (sec) to (code) 01 if duplicates are found given other codes such as 12, 7, and 6. I need to know how to set conditions as well. If code 7 is less than 60 do not sum. I have used the following code to sort by columns. the .isin however gets rid of "id" 5. In a larger file there will be other duplicate "id"s with similar codes.
df = df.sort_values(by=['id'], ascending=[True])
df2 = df.copy()
sort1 = df2[df2['code'].isin(['01', '07', '06', '12'])]
Desired Output:
project, id, sec, code
1, 5, 25, 03
1, 25, 120, 01
1, 25, 50, 12
1, 25, 45, 07
1, 25, 20, 06
I have thought of parsing through the file but I'm stuck on the logic.
def edit_data(df):
sum = 0
with open(df) as file:
next(file)
for line in file:
parts = line.split(',')
code = float(parts[3])
id = float(parts[1])
sec = float(parts[2])
return ?
Appreciate any help as I'm new in Python equivalent to 3 months experience. Thanks!
Let's try this:
df = df.sort_values('id')
#Use boolean indexing to eliminate unwanted records, then groupby and sum, convert the results to dataframe with indexes of groups.
sumdf = df[~((df.code == 7) & (df.sec < 60))].groupby(['project','id'])['sec'].sum().to_frame()
#Find first record of the group using duplicated and again with boolean indexing set the sec column for those records to NaN.
df.loc[~df.duplicated(subset=['project','id']),'sec'] = np.nan
#Set the index of the original dataframe and use combined_first to replace those NaN with values from the summed, grouped dataframe.
df_out = df.set_index(['project','id']).combine_first(sumdf).reset_index().astype(int)
df_out
Output:
project id code sec
0 1 5 3 25
1 1 25 1 120
2 1 25 12 50
3 1 25 7 45
4 1 25 6 20

new column containing header of another column based on conditionals

I'm relatively new to python and I feel this is a complex task
From dfa:
I'm trying to return the smallest and second smallest values from a range of columns (dist 1 through to dist 5) and return the name of the column where these values have come from (i.e. "dist_3"), placing this information into 4 new columns. A given distX column will have a mix of numbers and NaN either as string or np.nan.
dfa = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'NaN', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70]})
Task 1) I want to add two new columns "fir_closest" and "fir_closest_dist".
fir_closest_dist should contain the smallest value from columns dist1 through to dist5 (i.e. 20 for row 1, 11 for row 5).
fir_closest should contain the name of the column from where the value in fir_closest_dist came from (i.e. "dist2 for the first row)
Task 2) Repeat the above but for the second/next smallest value to create two new columns "sec_closest" and "sec_closest_dist"
Output table needs to look like dfb
dfb = pd.DataFrame({'date': ['09-03-1988', '10-03-1988', '11-03-1988', '12-03-1988', '13-03-1988'],
'dist1': ['NaN',2,'NaN','NaN', 30],
'dist2': [20, 21, 22, 23, 'NaN'],
'dist3': [120, 'Nan', 122, 123, 11],
'dist4': [40, 'NaN', 42, 43, 'NaN'],
'dist5': ['NaN',1,'NaN','NaN', 70],
'fir_closest': ['dist2','dist5','dist2','dist2', 'dist3'],
'fir_closest_dist': [20,1,22,23,11],
'sec_closest': ['dist4','dist1','dist4','dist4', 'dist1'],
'sec_closest_dist': [40,2,42,43,30]})
Please can you show code or explain how best to approach this. What is the name for this method of populating new columns?
Thanks in advance
I think this may do what you need.
import pandas as pd
import numpy as np
#Reproducibility and data generation for example
np.random.seed(0)
X = np.random.randint(low = 0, high = 10, size = (5,5))
#Your data
df = pd.DataFrame(X, columns = [f'dist{j}' for j in range(5)])
# Number of columns
ix = range(df.shape[1])
col_names = df.columns.values
#Find arg of kth smallest
arg_row_min,arg_row_min2,*rest = np.argsort(df.values, axis = 1).T
df['dist_min'] = col_names[arg_row_min]
df['num_min'] = df.values[ix,arg_row_min]
df['dist_min2'] = col_names[arg_row_min2]
df['num_min2'] = df.values[ix,arg_row_min2]
Assuming your DataFrame is named df, and you have run import pandas as pd and import numpy as np:
# Example data
df = pd.DataFrame({'date': pd.date_range('2017-04-15', periods=5),
'name': ['Mullion']*5,
'dist1': [pd.np.nan, pd.np.nan, 30, 20, 15],
'dist2': [40, 30, 20, 15, 16],
'dist3': [101, 100, 98, 72, 11]})
df
date dist1 dist2 dist3 name
0 2017-04-15 NaN 40 101 Mullion
1 2017-04-16 NaN 30 100 Mullion
2 2017-04-17 30.0 20 98 Mullion
3 2017-04-18 20.0 15 72 Mullion
4 2017-04-19 15.0 16 11 Mullion
# Select only those columns with numeric data types. In your case, this is
# the same as:
# df_num = df[['dist1', 'dist2', ...]].copy()
df_num = df.select_dtypes(np.number)
# Get the column index of each row's minimum distance. First, fill NaN with
# numpy's infinity placeholder to ensure that NaN distances are never chosen.
idxs = df_num.fillna(np.inf).values.argsort(axis=1)
# The 1st column of idxs (which is idxs[:, 0]) contains the column index of
# each row's smallest distance.
# The 2nd column of idxs (which is idxs[:, 1]) contains the column index of
# each row's second-smallest distance.
# Convert the index of each row's closest distance to a column name.
# (df.columns is a list-like that holds the column names of df.)
df['closest_name'] = df_num.columns[max_idxs[:, 0]]
# Now get the distances themselves by indexing the underlying numpy array
# of values. There may be a more pandas-specific way of doing this, but
# this should be very fast.
df['closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 0]]
# Same idea for the second-closest distances.
df['second_closest_name'] = df_num.columns[max_idxs[:, 1]]
df['second_closest_dist'] = df_num.values[np.arange(len(df_num)), max_idxs[:, 1]]
df
date dist1 dist2 dist3 name closest_name closest_dist \
0 2017-04-15 NaN 40 101 Mullion dist2 40.0
1 2017-04-16 NaN 30 100 Mullion dist2 30.0
2 2017-04-17 30.0 20 98 Mullion dist2 20.0
3 2017-04-18 20.0 15 72 Mullion dist1 20.0
4 2017-04-19 15.0 16 11 Mullion dist3 11.0
second_closest_name second_closest_dist
0 dist3 101.0
1 dist3 100.0
2 dist1 30.0
3 dist2 15.0
4 dist1 15.0

Categories

Resources