I am trying to pull data from two excel files using openpyxl, one file includes two columns, employee names and hours worked, the other,two columns, employee names and hourly wage. Ultimately, I'd like the files compared by name, have wage * hours worked, and then dumped into a third sheet by name and wages payable, but at this point, I'm struggling to get the items from two rows in the first sheet into excel to be able to manipulate them.
I thought I'd create two lists from the columns, the combine them into a dictionary, but I don't think that will get me where I need to be.
Any suggestions on how to get this data into python to manipulate it would be fantastic!
import openpyxl
wb = openpyxl.load_workbook("Test_book.xlsx")
sheet=wb.get_sheet_by_name('Hours')
employee_names=[]
employee_hours=[]
for row in sheet['A']:
employee_names.append(row.value)
for row in sheet['B']:
employee_hours.append(row.value)
my_dict=dict(zip(employee_names,employee_hours))
print(my_dict)
A list comprehension may do it. and using zip to iterate over
my_dict = {name:hours for name, hours in zip(sheet['A'], sheet['b'])}
what zip is doing is iterating through parallel lists.
Related
Excel Data
This is the data I've in an excel file. There are 10 sheets containing different data and I want to sort data present in each sheet by the 'BA_Rank' column in descending order.
After sorting the data, I've to write the sorted data in an excel file.
(for eg. the data which was present in sheet1 of the unsorted sheet should be written in sheet1 of the sorted list and so on...)
If I remove the heading from the first row, I can use the pandas (sort_values()) function to sort the data present in the first sheet and save it in another list.
like this
import pandas as pd
import xlrd
doc = xlrd.open_workbook('without_sort.xlsx')
xl = pd.read_excel('without_sort.xlsx')
length = doc.nsheets
#print(length)
#for i in range(0,length):
#sheet = xl.parse(i)
result = xl.sort_values('BA_Rank', ascending = False)
result.to_excel('SortedData.xlsx')
print(result)
So is there any way I can sort the data without removing the header file from the first row?
and how can I iterate between sheets so as to sort the data present in multiple sheets?
(Note: All the sheets contain the same columns and I need to sort every sheet using 'BA_Rank' in descending order.)
First input, you don't need to call xlrd when using pandas, it's done under the hood.
Secondly, the read_excel method its REALLY smart. You can (imo should) define the sheet you're pulling data from. You can also set up lines to skip, inform where the header line is or to ignore it (and then set column names manually). Check the docs, it's quite extensive.
If this "10 sheets" it's merely anecdotal, you could use something like xlrd to extract the workbook's sheet quantity and work by index (or extract names directly).
The sorting looks right to me.
Finally, if you wanna save it all in the same workbook, I would use openpyxl or some similar library (there are many others, like pyexcelerate for large files).
This procedure pretty much always looks like:
Create/Open destination file (often it's the same method)
Write down data, sheet by sheet
Close/Save file
If the data is to be writen all on the same sheet, pd.concat([all_dataframes]).to_excel("path_to_store") should get it done
So here is my situation. Using Python I want to copy specific columns from excel spreadsheet into specific columns into a csv worksheet.
The pre-filled column header names are named differently in each spreadsheet and I need to use a sublist as a parameter.
For example, in the first sublist, data column in excel needs to be copied from/to:
spreadsheet csv
"scan_date" => "date_of_scan"
Two sublists as parameters: one of names copied from excel, one of names of where to paste into csv.
Not sure if a dictionary sublist would be better than two individual sublists?
Also, the csv column header names are in row B (not row A like excel) which has complicated things such as data frames.
So, ideally I would like to have sublists converted to arrays,
spreadsheet iterates columns to find "scan_date"
copies data
iterates to find "date_of_scan" in csv
paste data
moves on to the second item in the sublists and repeats.
I've tried pandas and openpyxl and just can't seem to figure out the approach/syntax of how to do it.
Any help would be greatly appreciated.
Thank you.
Clarification edit:
The csv file has some preexisting data within. Also, I cannot change the headers into different columns. So, if "date_of_scan" is in column "RF" then it must stay in column "RF". I was able to copy, say, the 5 columns of data from excel into a temp spreadsheet and then concatenate into the csv but it always moved the pasted columns to the beginning of the csv document (columns A, B, C, D, E).
It is hard to know the answer without seeing you specific dataset, but it seems to me that a simpler approach might be to simply make your excel sheet a df, drop everything except the columns you want in the csv then write a csv with pandas. Here's some psuedo-code.
import pandas as pd
df=pd.read_excel('your_file_name.xlsx')
drop_cols=[,,,] #list of columns to get rid of
df.drop(drop_cols,axis='columns')
col_dict={'a':'x','b':'y','c':'z'} #however you want to map you new columns in this example abc are old columns and xyz are new ones
#this line will actually rename your columns with the dictionary
df=df.rename(columns=col_dict)
df.to_csv('new_file_name.csv') #write new file
and this will actually run in python, but I created the df from dummy data instead of an excel file.
#with dummy data
df=pd.DataFrame([0,1,2],index=['a','b','c']).T
col_dict={'a':'x','b':'y','c':'z'}
df=df.rename(columns=col_dict)
df.to_csv('new_file_name.csv') #write new file
I am using Pandas to split large csv to multiple csv each containing single row.
I have a csv having 1 million records and using below code it is taking to much time.
For Eg: In the above case there will be 1 million csv created.
Anyone can help me how to decrease time in splitting csv.
for index, row in lead_data.iterrows():
row.to_csv(row['lead_id']+".csv")
lead_data is the dataframe object.
Thanks
You don't need to loop through the data. Filter records by lead_id and the export the data to CSV file. That way you will be able to split the files based on the lead ID (assuming).
Example, split all EPL games where arsenal was at home:
data=pd.read_csv('footbal/epl-2017-GMTStandardTime.csv')
print("Selecting Arsenal")
ft=data.loc[data['HomeTeam']=='Arsenal']
print(ft.head())
# Export data to CSV
ft.to_csv('arsenal.csv')
print("Done!")
This way it is much faster than using one record at a time.
I need some help with the following.
I currently use python pandas to open a massive spreadsheet every day (this spreadsheet is a report, hence every day the data inside the spreadsheet is different). Pandas dataframe allows me to quickly crunch the data and generate a final output table, with much less data than the initial excel file.
Now, on day 1, I would need to add this output dataframe (3 rows 10 columns) to a new excel sheet (let's say sheet 1).
On day 2, I would need to take the new output of the dataframe and append it to the existing sheet 1. So at the end of day 2, the table in sheet1 would have 6 rows and 10 columns.
On day 3, same thing. I will launch my python pnadas tool, read data from the excel report, generate an output dataframe 3x10 and append it again to my excel file.
I can't find a way to append to an existing excel table.
Could anybody help?
Many thanks in advance,
Andrea
If you use openpyxl's utilities for dataframes then you should be able to do everything you need with the existing workbook, assuming this fits into memory.
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook("C:\Andrea\master_file.xlsx")
ws = wb[SHEETNAME]
for row in dataframe_to_rows(dt_today):
ws.append(row)
I am attempting to merge CSV files together.
Now I have 28 country need test, data in File-1.csv
And Every country has maybe 10 test Scenario,data in File-2.csv
I need the merge file, then I can use python to create N*M(280) unit cases.
File-1.csv
Case,Country,URL,Message1,Message2
01,UK,www.acuvue.co.uk, registersuccess,Fail
02,LU,www.acuvue.lu, denaissance,Vousdevez
03,DE,www.acuvue.de,,,
File-2.csv
Case,Country,Scenario,Mail,Name,Password
UK,InvalidMail,TEST,Susan_UK,Password1#
UK,InvalidPass,susan#test.com,Susan_UK,TEST
LU,InvalidMail,TEST,Susan_LU,Password1#
DE,InvalidMail,TEST,Susan_DE,Password1#
I want Python merge those two CSV file as below:
Case,Country,URL,Message1,Message2,Scenario,Mail,Name,Password
010,UK,www.acuvue.co.uk,registersuccess,Fail,InvalidMail,TEST,Susan_UK,Password1#
011,UK,www.acuvue.co.uk,registersuccess,Fail,InvalidPass,susan#test.com,Susan_UK,TEST
020,LU,www.acuvue.lu,denaissance,Vousdevez,InvalidMail,TEST,Susan_LU,Password1#
030,DE,www.acuvue.de,,,InvalidMail,TEST,Susan_DE,Password1#
How could I do this in Python?
Try reading both CSV into two seperate list and use zip_longest to merge two lists and store merged list into single list.