I am trying to load a number of csv files from a folder, use a function that calculates missing values on each file, then saves new csv files containing the output. When I edit the script to print the output, I get the expected result. However, the loop only ever saves the last file to the directory. The code I am using is:
from pathlib import Path
import pandas as pd
import os
import glob
files = glob("C:/Users/61437/Desktop/test_folder/*.csv") # get all csv's from folder
n = 0
for file in files:
print(file)
df = pd.read_csv(file, index_col = False)
d = calc_missing_prices(df) # calc_missing_prices is a user defined function
print(d)
d.to_csv(r'C:\Users\61437\Desktop\test_folder\derived_files\derived_{}.csv'.format(n+1), index = False)
The print() command returns the expected output, which for my data is:
C:/Users/61437/Desktop/test_folder\file1.csv
V_150 V_200 V_300 V_375 V_500 V_750 V_1000
0 3.00 2.75 4.50 6.03 8.35 12.07 15.00
1 2.32 3.09 4.63 5.00 9.75 12.50 12.25
2 1.85 2.47 3.70 4.62 6.17 9.25 12.33
3 1.75 2.00 4.06 6.50 6.78 10.16 15.20
C:/Users/61437/Desktop/test_folder\file2.csv
V_300 V_375 V_500 V_750 V_1000
0 4.00 4.50 6.06 9.08 11.00
1 3.77 5.00 6.50 8.50 12.56
2 3.00 3.66 4.88 7.31 9.50
C:/Users/61437/Desktop/test_folder\file3.csv
V_500 V_750 V_1000
0 5.50 8.25 11.00
1 6.50 8.50 12.17
2 4.75 7.12 9.50
However the only saved csv file is 'derived_1.csv' which contains the output from file3.csv
What am I doing that is preventing all three files from being created?
You are not incrementing n inside the loop. Your data gets stored in the file derived_1.csv, which is overwritten on every iteration. Once the for loop finishes executing, only the last csv will be saved.
Include the line n += 1 inside the for loop to increment it by 1 on every iteration.
Related
So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]
I am a beginner working with a clinical data set using Pandas in Jupyter Notebook.
A column of my data contains census tract codes and I am trying to merge my data with a large transportation data file that also has a column with census tract codes.
I initially only wanted 2 of the other columns from that transportation file so, after I downloaded the file, I removed all of the other columns except the 2 that I wanted to add to my file and the census tract column.
This is the code I used:
df_my_data = pd.read_excel("my_data.xlsx")
df_transportation_data = pd.read_excel("transportation_data.xlsx")
df_merged_file = pd.merge(df_my_data, df_transportation_data)
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
This worked but then I wanted to add the other columns from the transportation file so I used my initial file (prior to adding the 2 transportation columns) and tried to merge the entire transportation file. This resulted in a new DataFrame with all of the desired columns but only 4 rows.
I thought maybe the transportation file is too big so I tried merging individual columns (other than the 2 I was initially able to merge) and this again results in all of the correct columns but only 4 rows merging.
Any help would be much appreciated.
Edits:
Sorry for not being more clear.
Here is the code for the 2 initial columns I merged:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_two_columns = pd.read_excel('two_columns_from_transportation_file.xlsx')
df_two_columns_merged = pd.merge(df_my_data, df_two_columns, on=['census_tract'])
df_two_columns_merged.to_excel('two_columns_merged.xlsx', index = False)
The outputs were:
df_my_data.head()
census_tract id e t
0 6037408401 1 1 1092
1 6037700200 2 1 1517
2 6065042740 3 1 2796
3 6037231210 4 1 1
4 6059076201 5 1 41
df_two_columns.head()
census_tract households_with_no_vehicle vehicles_per_household
0 6001400100 2.16 2.08
1 6001400200 6.90 1.50
2 6001400300 17.33 1.38
3 6001400400 8.97 1.41
4 6001400500 11.59 1.39
df_two_columns_merged.head()
census_tract id e t households_with_no_vehicle vehicles_per_household
0 6037408401 1 1 1092 4.52 2.43
1 6037700200 2 1 1517 9.88 1.26
2 6065042740 3 1 2796 2.71 1.49
3 6037231210 4 1 1 25.75 1.35
4 6059076201 5 1 41 1.63 2.22
df_my_data has 657 rows and df_two_columns_merged came out with 657 rows.
The code for when I tried to merge the entire transport file:
import pandas as pd
df_my_data = pd.read_excel('my_data.xlsx')
df_transportation_data = pd.read_excel('transportation_data.xlsx')
df_merged_file = pd.merge(df_my_data, df_transportation_data, on=['census_tract'])
df_merged_file.to_excel('my_merged_file.xlsx', index = False)
The output:
df_transportation_data.head()
census_tract Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6001400100 0.00 12.60 65.95 2.16 20.69 0.76 2.08
1 6001400200 5.68 3.66 45.79 6.90 39.01 5.22 1.50
2 6001400300 7.55 6.61 46.77 17.33 31.19 6.39 1.38
3 6001400400 8.85 11.29 43.91 8.97 27.67 4.33 1.41
4 6001400500 8.45 7.45 46.94 11.59 29.56 4.49 1.39
df_merged_file.head()
census_tract id e t Bike Carpooled Drove Alone Households No Vehicle Public Transportation Walk Vehicles per Household
0 6041119100 18 0 2755 1.71 3.02 82.12 4.78 8.96 3.32 2.10
1 6061023100 74 1 1201 0.00 9.85 86.01 0.50 2.43 1.16 2.22
2 6041110100 80 1 9 0.30 4.40 72.89 6.47 13.15 7.89 1.82
3 6029004902 123 0 1873 0.00 18.38 78.69 4.12 0.00 0.00 2.40
The df_merged_file only has 4 total rows.
So my question is: why is it that I am able to merge those initial 2 columns from the transportation file and keep all of the rows from my file but when I try to merge the entire transportation file I only get 4 rows of output?
I recommend specifying merge type and merge column(s).
When you use pd.merge(), the default merge type is inner merge, and on the same named columns using:
df_merged_file = pd.merge(df_my_data, df_transportation_data, how='left', left_on=[COLUMN], right_on=[COLUMN])
It is possible that one of the columns you removed from the "transportation_data.xlsx" file previously is the same name as a column in your "my_data.xlsx", causing unmatched rows to be removed due to an inner merge.
A 'left' merge would allow the two columns you need from "transportation_data.xlsx" to attach to values in your "my_data.xlsx", but only where there is a match. This means your merged DataFrame will have the same number of rows as your "my_data.xlsx" has currently.
Well, I think there was something wrong with the initial download of the transportation file. I downloaded it again and this time I was able to get a complete merge. Sorry for being an idiot. Thank you all for your help.
I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.
METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])
This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]
new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')
I have a dataframe like this:
Code Date Open High Low Close Volume VWAP TWAP
0 US_GWA_BTC 2014-04-01 467.28 488.62 467.28 479.56 74,776.48 482.76 482.82
1 GWA_BTC 2014-04-02 479.20 494.30 431.32 437.08 114,052.96 460.19 465.93
2 GWA_BTC 2014-04-03 437.33 449.74 414.41 445.60 91,415.08 432.29 433.28
.
316 MWA_XRP_US 2018-01-19 1.57 1.69 1.48 1.53 242,563,870.44 1.59 1.59
317 MWA_XRP_US 2018-01-20 1.54 1.62 1.49 1.57 140,459,727.30 1.56 1.56
I want to filter out rows where code which has GWA infront of it.
I tried this code but it's not working.
df.set_index("Code").filter(regex='[GWA_]*', axis=0)
Try using startswith:
df[df.Code.str.startswith('GWA')]
To read files from a directory, try the following:
import os
import pandas as pd
path=os.getcwd()
files=os.listdir(path)
files
['wind-diciembre.xls', 'stat_noviembre.xls', 'stat_marzo.xls', 'wind-noviembre.xls', 'wind-enero.xls', 'stat_octubre.xls', 'wind-septiembre.xls', 'stat_septiembre.xls', 'wind-febrero.xls', 'wind-marzo.xls', 'wind-julio.xls', 'wind-octubre.xls', 'stat_diciembre.xls', 'stat_julio.xls', 'wind-junio.xls', 'stat_abril.xls', 'stat_enero.xls', 'stat_junio.xls', 'stat_agosto.xls', 'stat_febrero.xls', 'wind-abril.xls', 'wind-agosto.xls']
where:
stat_enero
Fecha HR PreciAcu RadSolar T Presion Tmax HRmax \
01/01/2011 37 0 162 18.5 0 31.2 86
02/01/2011 70 0 58 12.0 0 14.6 95
03/01/2011 62 0 188 15.3 0 24.9 86
04/01/2011 69 0 181 17.0 0 29.2 97
.
.
.
Presionmax RadSolarmax Tmin HRmin Presionmin
0 0 774 12.3 9 0
1 0 314 9.2 52 0
2 0 713 8.3 32 0
3 0 730 7.7 26 0
.
.
.
and
wind-enero
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
.
.
.
The next step is to read, parse and add the files to a dataframe, Now I do the following:
for f in files:
data=pd.ExcelFile(f)
data1=data.sheet_names
print data1
[u'diciembre']
[u'Hoja1']
[u'Hoja1']
[u'noviembre']
[u'enero']
[u'Hoja1']
[u'septiembre']
[u'Hoja1']
[u'febrero']
[u'marzo']
[u'julio']
.
.
.
for sheet in data1:
data2=data.parse(sheet)
data2
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
05/08/2011 00:00 6.0 22.5 26.3 4.4 68.7 ENE
06/08/2011 00:00 4.9 23.8 23.0 3.3 57.3 ENE
07/08/2011 00:00 3.4 12.9 20.2 1.6 104.0 ESE
08/08/2011 00:00 4.0 20.5 22.4 2.6 79.1 ENE
09/08/2011 00:00 4.1 22.4 25.8 2.9 74.1 ENE
10/08/2011 00:00 4.6 18.4 24.0 2.3 52.1 ENE
11/08/2011 00:00 5.0 22.3 27.8 3.3 65.0 ENE
12/08/2011 00:00 5.4 24.9 25.6 4.1 78.7 ENE
13/08/2011 00:00 5.3 26.0 31.7 4.5 79.7 ENE
14/08/2011 00:00 5.9 31.7 29.2 4.5 59.5 ENE
15/08/2011 00:00 6.3 23.0 25.1 4.6 70.8 ENE
16/08/2011 00:00 6.3 19.5 30.8 4.8 64.0 ENE
17/08/2011 00:00 5.2 21.2 25.3 3.9 57.5 ENE
18/08/2011 00:00 5.0 22.3 23.7 2.6 59.4 ENE
19/08/2011 00:00 4.4 21.6 27.5 2.4 57.0 ENE
The above output shows only part of the file,how I can parse all files and add them to a dataframe
First off, it appears you have a few different datasets in these files. You may want them all in one dataframe, but for now, I am going to assume you want them separated. Ex (All of the wind*.xls files in one dataframe and all of the stat*.xls files in another.) You could parse the data using read_excel and then concatenate the results using the timestamp as the index as follows:
import numpy as np
import pandas as pd, datetime as dt
import glob, os
runDir = "Path to files"
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob("wind*.xls")
df = pd.DataFrame()
for each in files:
sheets = pd.ExcelFile(each).sheet_names
for sheet in sheets:
df = df.append(pd.read_excel(each, sheet, index_col='Fecha'))
You now have a time-indexed dataframe! If you really want to have all of the data in one dataframe (from all of the file types), you can just adjust the glob to include all of the files using something like glob.glob('*.xls'). I would warn from personal experience that it may be easier for you to read in each type of data separately and then merge them after you have done some error checking/munging etc.
Below solution is just a minor tweak on #DavidHagan's answer above.
This one includes a column to identify the read File No like F0, F1, etc.
and sheet no of each file as S0, S1, etc.
So that we can know where the rows came from.
import numpy as np
import pandas as pd, datetime as dt
import glob, os
import sys
runDir = r'c:\blah\blah'
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob(r'*.*xls*')
df = pd.DataFrame()
#fno is 0, 1, 2, ... (for each file)
for fno, each in enumerate(files):
sheets = pd.ExcelFile(each).sheet_names
# sno iss 0, 1, 2, ... (for each sheet)
for sno, sheet in enumerate(sheets):
FileNo = 'F' + str(fno) #F0, F1, F2, etc.
SheetNo = 'S' + str(sno) #S0, S1, S2, etc.
# print FileNo, SheetNo, each, sheet #debug info
#header = None if you don't want header or take this out.
#dfxl is dataframe of each xl sheet
dfxl = pd.read_excel(each, sheet, header=None)
#add column of FileNo and SheetNo to the dataframe
dfxl['FileNo'] = FileNo
dfxl['SheetNo'] = SheetNo
#now add the current xl sheet to main dataframe
df = df.append(dfxl)
After doing above.. i.e. reading multiple XL Files and Sheets into a single dataframe (df)... you can do this.. to get a sample row from each File, Sheet combination.. and the sample wil be available in dataframe (dfs1).
#get unique FileNo and SheetNo in dft2
dft2 = df.loc[0,['FileNo', 'SheetNo']]
#empty dataframe to collect sample from each of the read file/sheets
dfs1 = pd.DataFrame()
#loop through each sheet and fileno names
for row in dft2.itertuples():
#get a sample from each file to view
dfts = df[(df.FileNo == row[1]) & (df.SheetNo ==row[2])].sample(1)
#append the 1 sample to dfs1. this will have a sample row
# from each xl sheet and file
dfs1 = dfs1.append(dfts, ignore_index = True)
dfs1.to_clipboard()