I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.
I have data like this:
timestamp high windSpeed windDir windU windV
04/05/2019 10:02 100 4.39 179.1 -0.14 8.53
150 2.44 164.5 -1.26 4.57
200 4.29 180.9 0.12 8.32
04/05/2019 10:03 100 4.39 179.1 -0.15 8.53
150 2.44 164.5 -1.26 4.57
200 4.29 180.9 0.12 8.32
04/05/2019 10:04 100 4.52 179.1 -0.16 8.79
150 2.15 162.8 -1.24 4
200 3.34 181.9 0.21 6.49
04/05/2019 10:05 100 4.52 179.1 -0.17 8.79
150 2.15 162.8 -1.24 4
200 3.34 181.9 0.21 6.49
and I want to subtract the value from higher level with lower level in each time.This is what I got so far, but this one only give me 1 value. Anyone can help me please? thank you.
for timestamp, group in grouped:
HeightIndices = group["high"].keys()
for heightIndex in range(HeightIndices[0], HeightIndices[0] + len(HeightIndices) - 1):
windMag = sqrt(group["windU"] ** 2 + group["windV"] ** 2)
diffMag = windMag[heightIndex+1]-windMag[heightIndex]
I'm not sure if I'm accomplishing what you're asking, but based on my looking at your code, it seems you are trying to get the difference between the i-th and i+1-th index in the column "high" and call that variable diffMag. If that's the case you can probably use one of the two methods.
Solution 1:
diff_mag = []
for i in range(len(wind['height'])-1):
diff_mag[i] = wind['height'][i+1] - wind['height'][i]
Solution 2:
Use numpy diff.
np.diff(wind['height'])
I made the assumption you're using pandas here based on what your code block looks like. Hope that helps.
EDIT
Okay..I think I understand what you are saying now.
I think this should work:
windMag = []
for timestamp, group in grouped:
HeightIndices = group["high"].keys()
for heightIndex in range(HeightIndices[0], HeightIndices[0] + len(HeightIndices) - 1):
windMag.append(sqrt(group["windU"] ** 2 + group["windV"] ** 2))
diffMag = np.diff(windMag)
I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.
METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])
This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]
new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')
Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)
If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.