I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.
Related
This is my data frame (labeled unp):
data frame unp
LOCATION TIME Unemployment_Rate Unit_Labour_Cost GDP_CAP PTEmployment HR_WKD Collective IndividualCollective Individual Temp GDPCAP_ULC GDP_Growth
0 AUT 2013 5.336031 2.632506 47936.67796 19.863556 1632.1 2.14 1.80 1.66 1.47 18209.522774 NaN
1 AUT 2014 5.621219 1.996807 48813.53441 20.939237 1621.6 2.14 1.80 1.66 1.47 24445.794917 876.85645
2 AUT 2015 5.723468 1.515733 49925.22780 21.026548 1598.9 2.14 1.80 1.66 1.47 32938.009399 1111.69339
3 AUT 2016 6.014071 1.610391 50923.69330 20.889132 1609.4 2.14 1.80 1.66 1.47 31621.943553 998.46550
4 BEL 2013 8.425185 1.988013 43745.95156 18.212509 1558.0 2.48 2.22 2.11 1.91 22004.861920 -7177.74174
... ... ... ... ... ... ... ... ... ... ... ... ... ...
101 SWE 2016 6.991096 1.899792 48690.14644 13.800736 1626.0 2.72 2.54 2.48 1.55 25629.198586 779.74573
102 USA 2013 7.375000 1.099109 53016.28880 12.255613 1782.0 1.33 1.31 1.30 0.27 48235.697096 4326.14236
103 USA 2014 6.166667 2.027852 54935.20048 10.611552 1784.0 1.33 1.31 1.30 0.27 27090.340163 1918.91168
104 USA 2015 5.291667 1.912012 56700.88042 9.879047 1785.0 1.33 1.31 1.30 0.27 29655.086066 1765.67994
105 USA 2016 4.866667 1.045644 57797.46221 9.454144 1781.0 1.33 1.31 1.30 0.27 55274.512367 1096.58179
I want to change the row GDP_Growth which is currently blank to have the value of:
unp.GDP_CAP - unp.GDP_CAP.shift(1)
If it fulfils the condition that the 'TIME' is not 2014 or >2014, else it should be N/A
Tried using the if function directly but it's not working:
if unp.loc[unp['TIME'] > 2014]:
unp['GDP_Growth'] = unp.GDP_CAP - unp.GDP_CAP.shift(1)
else:
return
You should avoid the if statement when using dataframes as it will be slower (less efficient).
In place, depending on what you need, you can use np.where().
because the dataframe in the question is a picture (as opposed to text), i give you the standard implementation, which looks like this:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9]})
# Use np.where() to select values from column 'A' where column 'B' is greater than 7
result = np.where(df['B'] > 7, df['A'], 0)
# Print the result
print(result)
The result of the above is this:
[0, 0, 0, 4, 5]
You will need to modify the above for your particular dataframe.
The question in title is currently Python: How do I use the if function when calling out a specific row?, which my answer will not apply to. Instead, we will compute the derivate / 'growth' and selectively apply it.
Explanation: In Python, you generally want to use a functional programming style to keep most computations outside of the Python interpreter and instead work with C-implemented functions.
Solution:
A. Obtain the derivate/'growth'
For your dataframe df = pd.DataFrame(...) you can obtain the change in value for a specific column with df['column_name'].diff(), e.g.
# This is your dataframe
In : df
Out:
gdp growth year
0 0 <NA> 2000
1 1 <NA> 2001
2 2 <NA> 2002
3 3 <NA> 2003
4 4 <NA> 2004
In : df['gdp'].diff()
Out:
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
Name: year, dtype: float64
B. Apply it to the 'growth' column
In :df['growth'] = df['gdp'].diff()
df
Out:
gdp growth year
0 0 NaN 2000
1 1 1.0 2001
2 2 1.0 2002
3 3 1.0 2003
4 4 1.0 2004
C. Selectively exclude values
If you then want specific years to have a certain value, apply them selectively
In : df['growth'].iloc[np.where(df['year']<2003)] = np.nan
df
Out:
gdp growth year
0 0 NaN 2000
1 1 NaN 2001
2 2 NaN 2002
3 3 1.0 2003
4 4 1.0 2004
I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.
I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]
def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list
Need help converting a txt file to csv with the rows and columns intact. The text file is here:
(http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2020&MONTH=06&FROM=2300&TO=2300&STNM=72265)
So far I only have this...
df = pd.read_csv('sounding-72265-2020010100.txt',delimiter=',')
df.to_csv('sounding-72265-2020010100.csv')
But it has only one column with all the other columns within its rows.
Instead want with to format it to something like this
CSV Format
Thanks for any help
I'm assuming you can start with text copied from the website; i.e. you create a data.txt file looking like the following by copy/pasting:
1000.0 8
925.0 718
909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
...
...
...
Then the following works, mainly based on this answer:
import pandas as pd
df = pd.read_table('data.txt', header=None, sep='\n')
df = df[0].str.strip().str.split('\s+', expand=True)
You read the data only separating by new lines, generating a one column df. Then use string methods to format the entries and expand them into a new DataFrame.
You can then add the column names in as such with help from this answer:
col1 = 'PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV'.split()
col2 = 'hPa m C C % g/kg deg knot K K K '.split()
df.columns = pd.MultiIndex.from_tuples(zip(col1,col2), names = ['Variable','Unit'])
The result (df.head()):
Variable PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
Unit hPa m C C % g/kg deg knot K K K
0 1000.0 8 None None None None None None None None None
1 925.0 718 None None None None None None None None None
2 909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
3 900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
4 883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
I would actually probably drop the "Units" column name were it me, b/c I think the multiindex columns can make things more complicated to slice.
Again, both reading the data and column names assume you can just copy paste those into a text file/into Python and then parse. If you are reading many pages like this, or were looking to do some sort of web scraping, that will require additional work.
How can I get my JSON data into a reasonable data frame? I have a deeply nested file which I aim to get into a large data frame. All is described in the Github repository below:
http://www.github.com/simongraham/dataExplore.git
With nested jsons, you will need to walk through the levels, extracting needed segments. For the nutrition segment of the larger json, consider iterating through every nutritionPortions level and each time running the pandas normalization and concatenating to final dataframe:
import pandas as pd
import json
with open('/Users/simongraham/Desktop/Kaido/Data/kaidoData.json') as f:
data = json.load(f)
# INITIALIZE DF
nutrition = pd.DataFrame()
# ITERATIVELY CONCATENATE
for item in data[0]["nutritionPortions"]:
if 'ftEnergyKcal' in item.keys(): # MISSING IN 3 OF 53 LEVELS
temp = (pd.io
.json
.json_normalize(item, 'nutritionNutrients',
['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
'ftEnergyKcal', 'vcPortionUnit','dtConsumedDate'])
)
nutrition = pd.concat([nutrition, temp])
nutrition.head()
Output
ftValue nPercentRI vcNutrient vcNutritionPortionId \
0 0.00 0.0 alcohol c993ac30-ecb4-4154-a2ea-d51dbb293f66
1 0.00 0.0 bcfa c993ac30-ecb4-4154-a2ea-d51dbb293f66
2 7.80 6.0 biotin c993ac30-ecb4-4154-a2ea-d51dbb293f66
3 49.40 2.0 calcium c993ac30-ecb4-4154-a2ea-d51dbb293f66
4 1.82 0.0 carbohydrate c993ac30-ecb4-4154-a2ea-d51dbb293f66
vcTrafficLight vcUnit dtConsumedDate \
0 g 2016-04-12T00:00:00
1 g 2016-04-12T00:00:00
2 µg 2016-04-12T00:00:00
3 mg 2016-04-12T00:00:00
4 g 2016-04-12T00:00:00
vcNutritionId ftEnergyKcal \
0 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
1 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
2 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
3 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
4 070b97a4-d562-427d-94a8-1de1481df5d1 18.2
vcUserId vcPortionName vcPortionSize \
0 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
1 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
2 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
3 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
4 fe585e3d-2863-46fe-a41f-290bf58ad169 1 mug 260
vcPortionId vcPortionUnit
0 2 ml
1 2 ml
2 2 ml
3 2 ml
4 2 ml
I am brand new to pandas and working with two dataframes. My goal is to append the non-date values of df_ls (below) column-wise to their nearest respective date in df_1. Is the only way to do this with a traditional for-loop or is their some more effective built-in method/function. I have googled this extensively without any luck and have only found ways to append blocks of dataframes to other dataframes. I haven't found a way to search through a dataframe and append a row in another dataframe at the nearest respective date. See example below:
Example of first dataframe (lets call it df_ls):
DATE ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 1999-07-04 0.070771 1.606958 1.292280 0.128069 0.103018
1 1999-07-20 0.030795 2.326290 1.728147 0.099020 0.073595
2 1999-08-21 0.022819 2.492871 1.762536 0.096888 0.068502
3 1999-09-06 0.014613 2.792271 1.894225 0.090590 0.061445
4 1999-10-08 0.004978 2.781847 1.790768 0.089291 0.057521
5 1999-10-24 0.003144 2.818474 1.805257 0.090623 0.058054
6 1999-11-09 0.000859 3.146100 1.993941 0.092787 0.058823
7 1999-12-11 0.000912 2.913604 1.656642 0.097239 0.055357
8 1999-12-27 0.000877 2.974692 1.799949 0.098282 0.059427
9 2000-01-28 0.000758 3.092533 1.782112 0.095153 0.054809
10 2000-03-16 0.002933 2.969185 1.727465 0.083059 0.048322
11 2000-04-01 0.016814 2.366437 1.514110 0.089720 0.057398
12 2000-05-03 0.047370 1.847763 1.401930 0.109767 0.083290
13 2000-05-19 0.089432 1.402798 1.178798 0.137965 0.115936
14 2000-06-04 0.056340 1.807828 1.422489 0.118601 0.093328
Example of second dataframe (let's call it df_1)
Sample Date Value
0 2000-05-09 1.68
1 2000-05-09 1.68
2 2000-05-18 1.75
3 2000-05-18 1.75
4 2000-05-31 1.40
5 2000-05-31 1.40
6 2000-06-13 1.07
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
In the end, my goal is to have something like this (Note the appended values are values closest to the Sample Date, even though they dont match up perfectly):
Sample Date Value ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
1 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
2 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
3 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
4 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
5 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
6 2000-06-13 1.07 ETC.... ETC.... ETC ...
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
Thanks for any and all help. As I said I am new to this and I have experience with this sort of thing in MATLAB but PANDAS is a new to me.
Thanks