Read multiple excel file with different sheets names in pandas - python

To read files from a directory, try the following:
import os
import pandas as pd
path=os.getcwd()
files=os.listdir(path)
files
['wind-diciembre.xls', 'stat_noviembre.xls', 'stat_marzo.xls', 'wind-noviembre.xls', 'wind-enero.xls', 'stat_octubre.xls', 'wind-septiembre.xls', 'stat_septiembre.xls', 'wind-febrero.xls', 'wind-marzo.xls', 'wind-julio.xls', 'wind-octubre.xls', 'stat_diciembre.xls', 'stat_julio.xls', 'wind-junio.xls', 'stat_abril.xls', 'stat_enero.xls', 'stat_junio.xls', 'stat_agosto.xls', 'stat_febrero.xls', 'wind-abril.xls', 'wind-agosto.xls']
where:
stat_enero
Fecha HR PreciAcu RadSolar T Presion Tmax HRmax \
01/01/2011 37 0 162 18.5 0 31.2 86
02/01/2011 70 0 58 12.0 0 14.6 95
03/01/2011 62 0 188 15.3 0 24.9 86
04/01/2011 69 0 181 17.0 0 29.2 97
.
.
.
Presionmax RadSolarmax Tmin HRmin Presionmin
0 0 774 12.3 9 0
1 0 314 9.2 52 0
2 0 713 8.3 32 0
3 0 730 7.7 26 0
.
.
.
and
wind-enero
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
.
.
.
The next step is to read, parse and add the files to a dataframe, Now I do the following:
for f in files:
data=pd.ExcelFile(f)
data1=data.sheet_names
print data1
[u'diciembre']
[u'Hoja1']
[u'Hoja1']
[u'noviembre']
[u'enero']
[u'Hoja1']
[u'septiembre']
[u'Hoja1']
[u'febrero']
[u'marzo']
[u'julio']
.
.
.
for sheet in data1:
data2=data.parse(sheet)
data2
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
05/08/2011 00:00 6.0 22.5 26.3 4.4 68.7 ENE
06/08/2011 00:00 4.9 23.8 23.0 3.3 57.3 ENE
07/08/2011 00:00 3.4 12.9 20.2 1.6 104.0 ESE
08/08/2011 00:00 4.0 20.5 22.4 2.6 79.1 ENE
09/08/2011 00:00 4.1 22.4 25.8 2.9 74.1 ENE
10/08/2011 00:00 4.6 18.4 24.0 2.3 52.1 ENE
11/08/2011 00:00 5.0 22.3 27.8 3.3 65.0 ENE
12/08/2011 00:00 5.4 24.9 25.6 4.1 78.7 ENE
13/08/2011 00:00 5.3 26.0 31.7 4.5 79.7 ENE
14/08/2011 00:00 5.9 31.7 29.2 4.5 59.5 ENE
15/08/2011 00:00 6.3 23.0 25.1 4.6 70.8 ENE
16/08/2011 00:00 6.3 19.5 30.8 4.8 64.0 ENE
17/08/2011 00:00 5.2 21.2 25.3 3.9 57.5 ENE
18/08/2011 00:00 5.0 22.3 23.7 2.6 59.4 ENE
19/08/2011 00:00 4.4 21.6 27.5 2.4 57.0 ENE
The above output shows only part of the file,how I can parse all files and add them to a dataframe

First off, it appears you have a few different datasets in these files. You may want them all in one dataframe, but for now, I am going to assume you want them separated. Ex (All of the wind*.xls files in one dataframe and all of the stat*.xls files in another.) You could parse the data using read_excel and then concatenate the results using the timestamp as the index as follows:
import numpy as np
import pandas as pd, datetime as dt
import glob, os
runDir = "Path to files"
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob("wind*.xls")
df = pd.DataFrame()
for each in files:
sheets = pd.ExcelFile(each).sheet_names
for sheet in sheets:
df = df.append(pd.read_excel(each, sheet, index_col='Fecha'))
You now have a time-indexed dataframe! If you really want to have all of the data in one dataframe (from all of the file types), you can just adjust the glob to include all of the files using something like glob.glob('*.xls'). I would warn from personal experience that it may be easier for you to read in each type of data separately and then merge them after you have done some error checking/munging etc.

Below solution is just a minor tweak on #DavidHagan's answer above.
This one includes a column to identify the read File No like F0, F1, etc.
and sheet no of each file as S0, S1, etc.
So that we can know where the rows came from.
import numpy as np
import pandas as pd, datetime as dt
import glob, os
import sys
runDir = r'c:\blah\blah'
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob(r'*.*xls*')
df = pd.DataFrame()
#fno is 0, 1, 2, ... (for each file)
for fno, each in enumerate(files):
sheets = pd.ExcelFile(each).sheet_names
# sno iss 0, 1, 2, ... (for each sheet)
for sno, sheet in enumerate(sheets):
FileNo = 'F' + str(fno) #F0, F1, F2, etc.
SheetNo = 'S' + str(sno) #S0, S1, S2, etc.
# print FileNo, SheetNo, each, sheet #debug info
#header = None if you don't want header or take this out.
#dfxl is dataframe of each xl sheet
dfxl = pd.read_excel(each, sheet, header=None)
#add column of FileNo and SheetNo to the dataframe
dfxl['FileNo'] = FileNo
dfxl['SheetNo'] = SheetNo
#now add the current xl sheet to main dataframe
df = df.append(dfxl)
After doing above.. i.e. reading multiple XL Files and Sheets into a single dataframe (df)... you can do this.. to get a sample row from each File, Sheet combination.. and the sample wil be available in dataframe (dfs1).
#get unique FileNo and SheetNo in dft2
dft2 = df.loc[0,['FileNo', 'SheetNo']]
#empty dataframe to collect sample from each of the read file/sheets
dfs1 = pd.DataFrame()
#loop through each sheet and fileno names
for row in dft2.itertuples():
#get a sample from each file to view
dfts = df[(df.FileNo == row[1]) & (df.SheetNo ==row[2])].sample(1)
#append the 1 sample to dfs1. this will have a sample row
# from each xl sheet and file
dfs1 = dfs1.append(dfts, ignore_index = True)
dfs1.to_clipboard()

Related

Combine a row with column in dataFrame and show the corresponding values

So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
​
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]

calculating the minimum, mean, and maximum values of the expanding window in a time series dataset

I found the following code for my task which I would need to compute mean,min,max of a timeseries dataframe up to each time step.
for instance the value for time step 10 should include all the information from time step 0 to time step 10.
The following code seems to be working for a series of data, I was wondering if there exists a pythonic way to do that for a dataframe
from pandas import DataFrame
from pandas import concat
series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)
temps = DataFrame(series.values)
window = temps.expanding()
dataframe = concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))
IIUC:
import pandas as pd
df = pd.read_csv('daily-min-temperatures.csv', header=0, index_col=0)
out = df['Temp'].expanding().agg(['min', 'mean', 'max']) \
.assign(**{'t+1': df['Temp'].shift(-1)})
Output:
>>> out
min mean max t+1
Date
1981-01-01 20.7 20.700000 20.7 17.9
1981-01-02 17.9 19.300000 20.7 18.8
1981-01-03 17.9 19.133333 20.7 14.6
1981-01-04 14.6 18.000000 20.7 15.8
1981-01-05 14.6 17.560000 20.7 15.8
... ... ... ... ...
1990-12-27 0.0 11.174712 26.3 13.6
1990-12-28 0.0 11.175377 26.3 13.5
1990-12-29 0.0 11.176014 26.3 15.7
1990-12-30 0.0 11.177254 26.3 13.0
1990-12-31 0.0 11.177753 26.3 NaN
[3650 rows x 4 columns]

how to read url .txt files using pandas

I have a problem reading files using pandas (read_csv). I can do it using the built in, with open(...), however it is much easier with pandas. I just need to read the data (numbers) between the ----. This is the LINK with one of my data url. There are more depending on the date that i insert. A sample of this is :
MONTHLY CLIMATOLOGICAL SUMMARY for JUN. 2020
NAME: Krieza Evias CITY: Krieza Evias STATE:
ELEV: 119 m LAT: 38° 24' 00" N LONG: 24° 18' 00" E
TEMPERATURE (°C), RAIN (mm), WIND SPEED (km/hr)
HEAT COOL AVG
MEAN DEG DEG WIND DOM
DAY TEMP HIGH TIME LOW TIME DAYS DAYS RAIN SPEED HIGH TIME DIR
------------------------------------------------------------------------------------
1 18.2 22.4 10:20 13.5 23:50 1.0 0.9 0.0 4.5 33.8 12:30 E
2 17.6 22.3 15:00 10.8 4:10 2.0 1.3 0.0 4.5 30.6 15:20 E
3 18.1 21.9 12:20 14.1 3:40 1.3 1.1 1.0 4.2 24.1 14:40 E
Keep in mind that i cannot just use skiprows=8 and skipfooter=9 to get the data between the --------, because not all files of this format have a specific number of footer (skipfooter)or title (skiprows) to skip. Some have 2 or 3 and some others have 8-9 lines of footer or title to skip. But every file has 2 lines of -------- where the data are between them.
I think you can't directly use read_csv but you could do this:
import urllib
from io import StringIO
count = 0
txt=""
data = urllib.request.urlopen(LINK)
for line in data:
if "---" in line.decode('windows-1252'):
count+=1
elif count==1:
txt+=line.decode('windows-1252')
else:
break
df = pd.read_csv(StringIO(txt), sep="\s+", header=None)
header is None because in your link column names are not in a row only but divided into multiple rows. If they're fixed I suggest you to put them by hand such as ["DAY", "MEAN TEMP", ...].

Loop using to_csv only printing last file (Python)

I am trying to load a number of csv files from a folder, use a function that calculates missing values on each file, then saves new csv files containing the output. When I edit the script to print the output, I get the expected result. However, the loop only ever saves the last file to the directory. The code I am using is:
from pathlib import Path
import pandas as pd
import os
import glob
files = glob("C:/Users/61437/Desktop/test_folder/*.csv") # get all csv's from folder
n = 0
for file in files:
print(file)
df = pd.read_csv(file, index_col = False)
d = calc_missing_prices(df) # calc_missing_prices is a user defined function
print(d)
d.to_csv(r'C:\Users\61437\Desktop\test_folder\derived_files\derived_{}.csv'.format(n+1), index = False)
The print() command returns the expected output, which for my data is:
C:/Users/61437/Desktop/test_folder\file1.csv
V_150 V_200 V_300 V_375 V_500 V_750 V_1000
0 3.00 2.75 4.50 6.03 8.35 12.07 15.00
1 2.32 3.09 4.63 5.00 9.75 12.50 12.25
2 1.85 2.47 3.70 4.62 6.17 9.25 12.33
3 1.75 2.00 4.06 6.50 6.78 10.16 15.20
C:/Users/61437/Desktop/test_folder\file2.csv
V_300 V_375 V_500 V_750 V_1000
0 4.00 4.50 6.06 9.08 11.00
1 3.77 5.00 6.50 8.50 12.56
2 3.00 3.66 4.88 7.31 9.50
C:/Users/61437/Desktop/test_folder\file3.csv
V_500 V_750 V_1000
0 5.50 8.25 11.00
1 6.50 8.50 12.17
2 4.75 7.12 9.50
However the only saved csv file is 'derived_1.csv' which contains the output from file3.csv
What am I doing that is preventing all three files from being created?
You are not incrementing n inside the loop. Your data gets stored in the file derived_1.csv, which is overwritten on every iteration. Once the for loop finishes executing, only the last csv will be saved.
Include the line n += 1 inside the for loop to increment it by 1 on every iteration.

How to make separator in pandas read_csv accept a defined range of spaces as a seperator

This question is similar to How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?
I have a text file in this format
year jan feb mar apr may jun jul aug sep oct nov dec win spr sum aut ann
2017 0.2 3.6 5.0 4.2 8.8 12.2 12.9 11.7 9.7 9.2 3.5 1.8 2.01 6.01 12.27 7.48 6.92
2018 2.4 -0.5 1.9 6.6 7.9 10.8 13.5 12.8 9.6 7.2 5.2 3.8 1.32 5.43 12.36 7.33 6.80
2019 0.9 1.8 4.4 3.6 6.5 10.8 13.3 12.6 10.0 7.2 3.6 2.9 2.22 4.85 12.25 6.90 6.49
2020 3.8 3.3 2.8 4.8 6.9 3.31 4.81
The text file has an irregular number of spaces between columns [3-4] and I do not need the columns ['win','spr','sum','aut','ann']
Firstly, to handle the irregular spaces I used this:
parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s+',parse_dates=parse_column, engine='python')
However, this collapsed the values for 'win' and 'spr' into 'jun' and 'jul'
Next I tried
parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s[0-4]',parse_dates=parse_column, engine='python')
But this results in
ValueError: 'year' is not in list
Finally I tried to remove the unnecessary columns as part of the import like this:
parse_column = ['year']
weather_data = pd.read_csv(StringIO(postString),delimiter=r'\s+',parse_dates=parse_column, engine='python',usecols=['year','jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'])
This however produces the same result as the first attempt.
I'm hoping there's a relatively simple regex that I'm missing, but variations on r'\s[01-5]' either exclude the 'year' column or return error messages such as x columns expected, y found
I'm trying to avoid having to remove these incorrectly parsed values after loading as there are so many variations of erring data as we move through the year.

Categories

Resources