How do I compare the two data frames and get the values? - python

The x data frame is information about departure and arrival, and the y data frame is latitude and longitude data for each location.
I try to calculate the distance between the origin and destination using the latitude and longitude data of start and end (e.g., start_x, start_y, end_x, end_y).
How can I connect x and y to bring the latitude data that fits each code into the x data frame?

The notation is somewhat confusing, but I took it after the question's notation.
One way to do would be by merging your dataframes into a new one like so :
Dummy dataframes:
import pandas as pd
x=[300,500,300,600,700]
y=[400,400,700,700,400]
code=[300,400,500,600,700]
start=[100,101,102,103,104]
end=[110,111,112,113,114]
x={"x":x, "y":y}
y={"code":code, "start":start, "end":end}
x=pd.DataFrame(x)
y=pd.DataFrame(y)
This gives:
x
x
y
0
300
400
1
500
400
2
300
700
3
600
700
4
700
400
y
code
start
end
0
300
100
110
1
400
101
111
2
500
102
112
3
600
103
113
4
700
104
114
Solution :
df = pd.merge(x,y,left_on="x",right_on="code").drop("code",axis=1)
df
x
y
start
end
0
300
400
100
110
1
300
700
100
110
2
500
400
102
112
3
600
700
103
113
4
700
400
104
114
df = df.merge(y,left_on="y",right_on="code").drop("code",axis=1)
df
x
y
start_x
end_x
start_y
end_y
0
300
400
100
110
101
111
1
500
400
102
112
101
111
2
700
400
104
114
101
111
3
300
700
100
110
104
114
4
600
700
103
113
104
114
Quick explanation :
The line df = pd.merge(...) creates the new dataframe by merging the left one (x) on the "x" column and the right one on the "code" column. The second line df = df.merge(...) takes the existing df as the left one, and uses its column "y" to merge the "code" column from the y dataframe.
The .drop("code",axis=1) is used to drop the unwanted "code" column resulting from the merging.
The _x and _y suffixes are added automatically when merging dataframes that have the same column names. To control it, use the "suffixe=.." option when calling the second merging (when the column with the same name are merging). In this case it works right with the default setting so no bothering with this if you use the x as right and y as left dataframes.

Related

How can I locate a subgroup of a dataframe based on more than one variable and replace a value only for that subgroup in the original dataframe?

I am quite new in Python and I have been facing some trouble to do the following:
I have a dataframe that I had to group based on different variables in order to analyze the data.
Package Package category Moisture Length Height Packing weight
0 YYS X NON DRY 2000 200 200
1 XXS Y NON DRY 190 20 200
2 GGT Z DRY 350 32 680
3 YYS X DRY 1000 209 280
4 YYS X DRY 3500 209 280
5 GGT Z DRY 350 37 680
6 XXS Y NON DRY 345 29 600
7 GGT Z DRY 350 37 680
8 GGT Z DRY 350 37 680
9 YYS X DRY 2000 209 285
10 YYS X NON DRY 3400 200 200
11 YYS X DRY 2000 209 280
12 XXS Y NON DRY 190 23 200
13 XXS Y NON DRY 190 23 200
14 GGT Z NON DRY 190 23 200
15 XXS Y NON DRY 190 23 200
16 GGT Z NON DRY 190 23 200
17 XXS Y NON DRY 336 20 600
18 XXS Y NON DRY 190 23 200
For this analysis, I search for a specific group, using the following:
data1.loc[(data1['Package category'] == 'X') & (data1['Package'] == 'YYS') & (data1['Moisture'] == 'DRY')
& (data1['Length'] == 2000) & (data1['Height'] == 209.0),:]
From that specific group I found that the values for the column 'Packing weight' are varying within this group and I would like to just have one values, therefore I need to replace all the rows if that group that have 280 as Packing weight value to 285. So I am using this:
data1.loc[(data1['Package category'] == 'X') & (data1['Package'] == 'YYS') & (data1['Moisture'] == 'DRY')
& (data1['Length'] == 2000) & (data1['Height'] == 209.0),:].replace({280.0:285})
The problem is that I would like this replacement to be shown in my original dataframe "data1".
But if I use the code above it just shows me as it has done the replacement, but going through the original dataframe data1, the change has not been done.
I have to do this analysis for different groups, and at the end, I would like to have these changes shown effectively on my one original dataframe "data1"
Is there a way I can do this?
Edit: after reading this: Pandas how can 'replace' work after 'loc'?
I suggest the following edit:
let's call for the whole filtering con (just it to be more clear here, you should change it your entire conditions for filtering):
data1.loc[con, :] = data1.loc[con,:].replace({280.0:285})
replace returns a new dataframe

Selecting a row interval according to a Column value in Pandas

Hi everyone I have a dataset that looks like this
transferid value type
5545 100 X
5123 40 A
5566 35 A
5675 700 X
5235 1100 A
5616 350 A
5772 170 X
it has it index for any purposes and what I would like to do is to slice the data set in rows, generating a new dataset like this one
df1=
transferid value type
5545 100 X
5123 40 A
5566 35 A
5675 700 X
df2=
transferid value type
5675 700 X
5235 1100 A
5616 350 A
5772 170 X
including the values like this. Is there a possibility to do this on a single slicing? I tried gathering the indexes and using df.loc to set the slicing intervals, but I haven't had any success with this approach. The dataset could start with any type of transfer but I need to slice between and every time it finds a transfer type X and if it finds no other type X at the end, slice till the end.
Thanks for any help in advance
IIUC:
i = np.where(df.type == "X")[0]
pd.concat({j: df.iloc[x:y] for j, (x, y) in enumerate(zip(i, i[1:] + 1))})
transferid value type
0 0 5545 100 X
1 5123 40 A
2 5566 35 A
3 5675 700 X
1 3 5675 700 X
4 5235 1100 A
5 5616 350 A
6 5772 170 X

Pandas Relative Time Pivot

I have the last eight months of my customers' data, however these months are not the same months, just the last months they happened to be with us. Monthly fees and penalties are stored in rows, but I want each of the last eight months to be a column.
What I have:
Customer Amount Penalties Month
123 500 200 1/7/2017
123 400 100 1/6/2017
...
213 300 150 1/4/2015
213 200 400 1/3/2015
What I want:
Customer Month-8-Amount Month-7-Amount ... Month-1-Amount Month-1-Penalties ...
123 500 400 450 300
213 900 250 300 200
...
What I've tried:
df = df.pivot(index=num, columns=[amount,penalties])
I got this error:
ValueError: all arrays must be same length
Is there some ideal way to do this?
You can do it with unstack and set_index
# assuming all date is sort properly , then we do cumcount
df['Month']=df.groupby('Customer').cumcount()+1
# slice the most recent 8 one
df=df.loc[df.Month<=8,:]# slice the most recent 8 one
# doing unstack to reshape your df
s=df.set_index(['Customer','Month']).unstack().sort_index(level=1,axis=1)
# flatten multiple index to one
s.columns=s.columns.map('{0[0]}-{0[1]}'.format)
s.add_prefix("Month-")
Out[189]:
Month-Amount-1 Month-Penalties-1 Month-Amount-2 Month-Penalties-2
Customer
123 500 200 400 100
213 300 150 200 400

Pandas GroupBy with special sum

Lets say I have data like that and I want to group them in terms of feature and type.
feature type size
Alabama 1 100
Alabama 2 50
Alabama 3 40
Wyoming 1 180
Wyoming 2 150
Wyoming 3 56
When I apply df=df.groupby(['feature','type']).sum()[['size']], I get this as expected.
size
(Alabama,1) 100
(Alabama,2) 50
(Alabama,3) 40
(Wyoming,1) 180
(Wyoming,2) 150
(Wyoming,3) 56
However I want to sum sizes with only the same type not both type and feature.While doing this I want to keep indexes as (feature,type) tuple. I mean I want to get something like this,
size
(Alabama,1) 280
(Alabama,2) 200
(Alabama,3) 96
(Wyoming,1) 280
(Wyoming,2) 200
(Wyoming,3) 96
I am stuck trying to find a way to do this. I need some help thanks
Use set_index for MultiIndex and then transform with sum for return same length Series by aggregate function:
df = df.set_index(['feature','type'])
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
EDIT: First aggregate both columns and then use transform
df = df.groupby(['feature','type']).sum()
df['size'] = df.groupby(['type'])['size'].transform('sum')
print (df)
size
feature type
Alabama 1 280
2 200
3 96
Wyoming 1 280
2 200
3 96
Here is one way:
df['size'] = df['type'].map(df.groupby('type')['size'].sum())
df.groupby(['feature', 'type'])['size_type'].sum()
# feature type
# Alabama 1 280
# 2 200
# 3 96
# Wyoming 1 280
# 2 200
# 3 96
# Name: size_type, dtype: int64

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Categories

Resources