i have a "database in the form of an excel file" used to generate reports, i want to build an app to generate the report using data exchange with this database, on windows, i'm trying to learn pandas through this,
1- am i wasting my time and should be using another approach instead
2- i need to create a data frame automatically, so i'm trying to extract column names for the data frame from my excel table
3- i managed to store the first row " the column names" in a list, how can i create a dataframe with heders using this list.
import pandas as pd
import os
import xlrd
cwd = os.getcwd()
book = xlrd.open_workbook('LISTERC.xlsx')
sheet = book.sheet_by_name("Sheet1")
count = 0
hdr = []
for c in range(10):
if (sheet.cell(0,c).value != ''):
count +=1
hdr.append (sheet.cell(0,c).value)
DF = pd.DataFrame(*zip(hdr))
Related
I am trying to process an Excel file with Pandas. The filter to be applied is by the values of the "Test Code" column which has the format "XX10.X/XX12.X" (i.e: EF10.1). The problem is that the dataframe and neglects everything after the dot when reading the column, leaving just "XX10". The information after the dot is the most important information.
The original document classifies those cells as a date, which probably is altering the normal processing of the values.
excelfile
The code I am using is:
import os
import pandas as pd
file = "H2020_TRI-HP_T6.2_PropaneIceFaultTests_v1"
folder = "J:\Downloads"
file_path = os.path.join(folder,file+".xlsx")
df = pd.read_excel(file_path,sheet_name="NF10")
df["Test Code"]
The output is:
output
I have a code that downloads data from yahoo finance to df for list of stocks. Than I create new spreadsheet for each stock. But I cannot manage to copy data from df to this spreadsheet.
n=number_of_stocks
m = 0
while n > 0:
x = Input_Stock_Names[m]
m+=1
n-=1
df = pdr.get_data_yahoo(x,starting_date,ending_date)
df = df.reset_index()
ExcelWrksht = ExcelWrkbook.Worksheets.Add()
ExcelWrksht.Name = x
ExcelWrksht = ExcelWrkbook.Worksheets(x)
Also excel file is open while code is running.
If the task is to simply store downloaded data, then using win32com is over-engineering. Simply use the facilities within pandas to write to the Excel file in the .xlsx format:
import pandas as pd
import yfinance as yf
from datetime import date,timedelta
#90 days of history from today
end_date = date.today()
start_date = end_date - timedelta(days=90)
Input_Stock_Names = ['AAPL','TSLA','MSFT']
with pd.ExcelWriter('c:\\somepath\\stocks.xlsx',mode='w') as ew:
for stock in Input_Stock_Names:
df = yf.download(stock,start=start_date,end=end_date)
df.to_excel(ew,stock)
This will create a new Excel file, with one sheet for each stock.
win32com allows you to 'drive' Excel, and do pretty much everything you would do if you had the Excel application open. However, it is relatively slow: the Excel application has to be started (and closed) and all the data and commands have to cross the 'process boundary' from one process (python) to the other (Excel).
Using ExcelWriter you simply write data to a file in the .xlsx format, so that Excel can read it later. If all you want to do is store data this is very much more efficient than using win32com.
For the past few days I've been trying to do a relatively simple task but I'd always encounter some errors so I'd really appreciate some help on this. Here goes:
I have an Excel file which contains a specific column (Column F) that has a list of IDs.
What I want to do is for the program to read this excel file and allow the user to input any of the IDs they would like.
When the user types in one of the IDs, I would want the program to return a bunch IDs that contain the text that the user has inputted, and after that I'd like to export those 'bunch of IDs' to a new & separate Excel file where all the IDs would be displayed in one column but in separate rows.
Here's my code so far, I've tried using arrays and stuff but nothing seems to be working for me :/
import pandas as pd
import numpy as np
import re
import xlrd
import os.path
import xlsxwriter
import openpyxl as xl;
from pandas import ExcelWriter
from openpyxl import load_workbook
# LOAD EXCEL TO DATAFRAME
xls = pd.ExcelFile('N:/TEST/TEST UTILIZATION/IA 2020/Dev/SCS-FT-IE-Report.xlsm')
df = pd.read_excel(xls, 'FT')
# GET USER INPUT (USE AD1852 AS EXAMPLE)
value = input("Enter a Part ID:\n")
print(f'You entered {value}\n\n')
i = 0
x = df.loc[i, "MFG Device"]
df2 = np.array(['', 'MFG Device', 'Loadboard Group','Socket Group', 'ChangeKit Group'])
for i in range(17367):
# x = df.loc[i, "MFG Device"]
if value in x:
df = np.array[x]
df2.append(df)
i += 1
print(df2)
# create excel writer object
writer = pd.ExcelWriter('N:/TEST/TEST UTILIZATION/IA 2020/Dev/output.xlsx')
# write dataframe to excel
df2.to_excel(writer)
# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')
Any help would be appreciated, thanks in advance! :)
It looks like you're doing much more than you need to do. Rather than monkeying around with xlsxwriter, pandas.DataFrame.to_excel is your friend.
Just do
df2.to_excel("output.xlsx")
You don't need xlsxwriter. Simply df.to_excel() would work. In your code df2 is a numpy array/ First convert it into a pandas DataFrame format a/c to the requirement (index and columns) before writing it to excel.
There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)
I have a 2 column CSV with download links in the first column and company symbols in the second column. For example:
http://data.com/data001.csv, BHP
http://data.com/data001.csv, TSA
I am trying to loop through the list so that Python opens each CSV via the download link and saves it separately as the company name. Therefore each file should be downloaded and saved as follows:
BHP.csv
TSA.csv
Below is the code I am using. It currently exports the entire CSV into a single row tabbed format, then loops back and does it again and again in an infinite loop.
import pandas as pd
data = pd.read_csv('download_links.csv', names=['download', 'symbol'])
file = pd.DataFrame()
cache = []
for d in data.download:
df = pd.read_csv(d,index_col=None, header=0)
cache.append(df)
file = pd.DataFrame(cache)
for s in data.symbol:
file.to_csv(s+'.csv')
print("done")
Up until I convert the list 'cache' into the DataFrame 'file' to export it, the data is formatted perfectly. It's only when it gets converted to a DataFrame when the trouble starts.
I'd love some help on this one as I've been stuck on it for a few hours.
import pandas as pd
data = pd.read_csv('download_links.csv')
links = data.download
file_names = data.symbol
for link, file_name in zip(links,file_names):
file = pd.read_csv(link).to_csv(file_name+'.csv', index=False)
Iterate over both fields in parallel:
for download, symbol in data.itertuples(index=False):
df = pd.read_csv(d,index_col=None, header=0)
df.to_csv('{}.csv'.format(symbol))