Python to combine Excel spreadsheets - python

Hello all…a question in using Panda to combine Excel spreadsheets.
The problem is that, sequence of columns are lost when they are combined. If there are more files to combine, the format will be even worse.
If gives an error message, if the number of files are big.
ValueError: column index (256) not an int in range(256)
What I am using is below:
import pandas as pd
df = pd.DataFrame()
for f in ['c:\\1635.xls', 'c:\\1644.xls']:
data = pd.read_excel(f, 'Sheet1')
data.index = [os.path.basename(f)] * len(data)
df = df.append(data)
df.to_excel('c:\\CB.xls')
The original files and combined look like:
what's the best way to combine great amount of such similar Excel files?
thanks.

I usually use xlrd and xlwt:
#!/usr/bin/env python
# encoding: utf-8
import xlwt
import xlrd
import os
current_file = xlwt.Workbook()
write_table = current_file.add_sheet('sheet1', cell_overwrite_ok=True)
key_list = [u'City', u'Country', u'Received Date', u'Shipping Date', u'Weight', u'1635']
for title_index, text in enumerate(key_list):
write_table.write(0, title_index, text)
file_list = ['1635.xlsx', '1644.xlsx']
i = 1
for name in file_list:
data = xlrd.open_workbook(name)
table = data.sheets()[0]
nrows = table.nrows
for row in range(nrows):
if row == 0:
continue
for index, context in enumerate(table.row_values(row)):
write_table.write(i, index, context)
i += 1
current_file.save(os.getcwd() + '/result.xls')

Instead of data.index = [os.path.basename(f)] * len(data) you should use df.reset_index().
For example:
1.xlsx:
a b
1 1
2 2
3 3
2.xlsx:
a b
4 4
5 5
6 6
code:
df = pd.DataFrame()
for f in [r"C:\Users\Adi\Desktop\1.xlsx", r"C:\Users\Adi\Desktop\2.xlsx"]:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.reset_index(inplace=True, drop=True)
df.to_excel('c:\\CB.xls')
cb.xls:
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
If you don't want the dataframe's index to be in the output file, you can use df.to_excel('c:\\CB.xls', index=False).

Related

Pandas Concat vs append and join columns --> ("state", "state:", "State")

I join 437 tables and I get 3 columns for state as my coworkers feel like giving it a different name each day, ("state", "state:" and "State"), is there a way that joins those 3 columns to just 1 column called "state"?.
*also my code uses append, I just saw its deprecated, will it work the same using concat? any way to make it give the same results as append?.
I tried:
excl_merged.rename(columns={"state:": "state", "State": "state"})
but it doesn't do anything.
The code I use:
# importing the required modules
import glob
import pandas as pd
# specifying the path to csv files
path = "X:/.../Admission_merge"
# csv files in the path
file_list = glob.glob(path + "/*.xlsx")
# list of excel files we want to merge.
# pd.read_excel(file_path) reads the excel
# data into pandas dataframe.
excl_list = []
for file in file_list:
excl_list.append(pd.read_excel(file)) #use .concat will it give the columns in the same order?
# create a new dataframe to store the
# merged excel file.
excl_merged = pd.DataFrame()
for excl_file in excl_list:
# appends the data into the excl_merged
# dataframe.
excl_merged = excl_merged.append(
excl_file, ignore_index=True)
# exports the dataframe into excel file with
# specified name.
excl_merged.to_excel('X:/.../Admission_MERGED/total_admission_2021-2023.xlsx', index=False)
print("Merge finished")
Any suggestions how I can improve it? also is there a way to remove unnamed empty columns?.
Thanks a lot.
You can use pd.concat:
excl_list = ['state1.xlsx', 'state2.xlsx', 'state3.xlsx']
state_map = {'state:': 'state', 'State': 'state'}
data = []
for excl_file in excl_list:
df = pd.read_excel(excl_file)
# Case where first row is empty
if df.columns[0].startswith('Unnamed'):
df.columns = df.iloc[0]
df = df.iloc[1:]
df = df.rename(columns=state_map)
data.append(df)
excl_merged = pd.concat(data, ignore_index=True)
# Output
ID state
0 A a
1 B b
2 C c
3 D d
4 E e
5 F f
6 G g
7 H h
8 I i
file1.xlsx:
ID State
0 A a
1 B b
2 C c
file2.xlsx:
ID state
0 D d
1 E e
2 F f
file3.xlsx:
ID state:
0 G g
1 H h
2 I i
If you have empty columns, you can use data.append(df.dropna(how='all', axis=1)) before appending to data list.

select cells from an excel table python

i have a an excel table that contains :
ID product
03/1/2021
16/1/2022
12/2/2022
14/3/2023
A
4
1
2
5
B
6
1
3
C
7
6
and in the same sheet I have a drop down list that contains(the year , and the month)
if i select in the drop down list for example year = 2020 and month= 1,
it will be return something like this:
ID product
03/1/2021
A
4
B
6
C
and then it will calculate the som of the cells : som = 10 in this case
here is my code :
# import load_workbook
import pandas as pd
import numpy as np
from openpyxl import load_workbook
from openpyxl.worksheet.datavalidation import DataValidation
from openpyxl import Workbook
from openpyxl.styles import PatternFill
# set file path
filepath= r'test.xlsx'
wb=load_workbook(filepath)
ws=wb["sheet1"]
#Generates 10 year in the Column MK;
for number in range(1,10):
ws['MK{}'.format(number)].value= "202{}".format(number)
data_val = DataValidation(type="list",formula1='=MK1:MK10')
ws.add_data_validation(data_val)
# drop down list with all the values from the column MK
data_val.add(ws["E2"])
#Generates the numbers of month in the Column MN;
for numbers in range(1,12):
ws['MN{}'.format(numbers)].value= "{}".format(numbers)
data_vals = DataValidation(type="list",formula1='=MN1:MN14')
ws.add_data_validation(data_vals)
# drop down list with all the values from the sheet list column MK
data_vals.add(ws["E3"])
# add a color to the cell 'year' and 'month'
ws['E2'].fill = PatternFill(start_color='FFFFFF00', end_color='FFFFFF00', fill_type = 'solid')
ws['E3'].fill = PatternFill(start_color='FFFFFF00', end_color='FFFFFF00', fill_type = 'solid')
# save workbook
wb.save(filepath)
Any suggestions?
thank you for your help.
Assuming your excel file looks like below:
Final Code looks like below:
import xlrd
file = r'C:\path\test_exl.xlsx'
sheetname='Sheet1'
n=2
df = pd.read_excel(file,skiprows=[*range(2)],index_col=[0])
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_name(sheetname)
year = worksheet.cell(0,0).value
month = worksheet.cell(1,0).value
datetime_cols= pd.to_datetime(df.columns,dayfirst=True,errors='coerce')
out = (df.loc[:,(datetime_cols.year == year) & (datetime_cols.month == month)]
.reset_index())
print(out)
ID Product 03-01-2021
0 A 4.0
1 B 6.0
2 C NaN
Breakdown:
you can first read the table in pandas using pd.read_excel:
file = r'C:\path\test_exl.xlsx'
sheetname='Sheet1'
n=2 #change n to how many lines to skip to read the table.
#In the above image my dataframe starts at line 3 onwards so I put n=2
df = pd.read_excel(file,skiprows=[*range(n)],index_col=[0])
Then access the cell values using xlrd:
import xlrd
workbook = xlrd.open_workbook(file)
worksheet = workbook.sheet_by_name(sheetname)
year = worksheet.cell(0,0).value #A1 is 0,0
month = worksheet.cell(1,0).value #A2 is 1,0 and so on..
#print(year,month) gives 2021 and 1
then convert columns to datetime and filter:
datetime_cols= pd.to_datetime(df.columns,dayfirst=True,errors='coerce')
out = (df.loc[:,(datetime_cols.year == year) & (datetime_cols.month == month)]
.reset_index())

Two type of headers txt to Pandas dataframe

Let's say I have a .txt file like that:
#D=H|ID|STRINGIDENTIFIER
#D=T|SEQ|DATETIME|VALUE
H|879|IDENTIFIER1
T|1|1569972384|7
T|2|1569901951|9
T|3|1569801600|8
H|892|IDENTIFIER2
T|1|1569972300|109
T|2|1569907921|101
T|3|1569803600|151
And I need to create a dataframe like this:
IDENTIFIER SEQ DATETIME VALUE
879_IDENTIFIER1 1 1569972384 7
879_IDENTIFIER1 2 1569901951 9
879_IDENTIFIER1 3 1569801600 8
892_IDENTIFIER2 1 1569972300 109
892_IDENTIFIER2 2 1569907921 101
892_IDENTIFIER2 3 1569803600 151
What would be the possible code?
A basic way to do it might just to be to process the text file and convert it into a csv before using the read_csv function in pandas. Assuming the file you want to process is as consistent as the example:
import pandas as pd
with open('text.txt', 'r') as file:
fileAsRows = file.read().split('\n')
pdInput = 'IDENTIFIER,SEQ,DATETIME,VALUE\n' #addHeader
for row in fileAsRows:
cols = row.split('|') #breakup row
if row.startswith('H'): #get identifier info from H row
Identifier = cols[1]+'_'+cols[2]
if row.startswith('T'): #get other info from T row
Seq = cols[1]
DateTime = cols[2]
Value = cols[3]
tempList = [Identifier,Seq, DateTime, Value]
pdInput += (','.join(tempList)+'\n')
with open("pdInput.csv", "a") as file:
file.write(pdInput)
## import into pandas
df = pd.read_csv("pdInput.csv")

Python for loop enumerate

I am reading multiple csv files and combine it in one csv file. The desired outcome of the combined data looks like the following:
0 4 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
But in the following code, I intend the column to go from 0,4,6,8,10,12.
for indx, file in enumerate(files_File1):
if file.endswith('csv'): #reading csv filed in the designated folder
filepath = os.path.join(folder_File1, file) #reading csv filed in the designated folder
current = pd.read_csv(filepath, header=None) #reading csv filed in the designated folder
if indx == 0:
mydata_File1 = current.copy()
mydata_File1.columns.values[1] = 4
print(mydata_File1.columns.values)
else:
mydata_File1[2*indx+4] = current.iloc[:,1]
print(mydata_File1.columns.values)
But instead, the outcome looks like this where the column goes from 0,2,4,6,8,10,12.
0 4 2 6 8 10 12
1 2 5 4 2 1
5 3 0 1 5 10
....
I am not quite sure what causes the column named "2".
Any idea?
If there is some reason you need panda, then this will work. Your code references mydata_File1.columns.values which is the name of the columns, not the value in the columns. If this doesn't answer your question, then please provide more complete answer per #juanpa.arrivillaga comment.
#! python3
import os
import pandas as pd
import glob
folder_File1 = r"C:\Users\Public\Documents\Python\CombineCSVFiles"
csv_only = r"\*.csv"
files_File1 = glob.glob(f'{folder_File1}{csv_only}')
new_csv = f'{folder_File1}\\newcsv.csv'
mydata_File1 = []
for indx, file in enumerate(files_File1):
if file == new_csv:
pass
else:
current = pd.read_csv(file, header=None) #reading csv filed in the designated folder
print (current)
if indx == 0:
mydata_File1 = current.copy()
print(mydata_File1.values)
else:
pass
mydata_File1 = mydata_File1.append(current, ignore_index=True)
print(mydata_File1.values)
mydata_File1.to_csv(new_csv)
If you are really just trying to combine .csv files, no need for panda.
#! python3
import glob
folder_File1 = r"C:\Users\Public\Documents\Python\CombineCSVFiles"
csv_only = r"\*.csv"
files_File1 = glob.glob(f'{folder_File1}{csv_only}')
new_csv = f'{folder_File1}\\newcsv.csv'
lines = []
for file in files_File1:
with open(file) as filein:
if filein.name == new_csv:
pass
else:
for line in filein:
line = line.strip() # or some other preprocessing
lines.append(line) # storing everything in memory!
with open(new_csv, 'w') as out_file:
out_file.writelines(line + u'\n' for line in lines)

Efficient way to get the row number in a ordered CSV file that is higher than a specific unix timestamp

I have a very big CSV file in the format where the first column is a unix timestamp already sorted from lowest to highest.
1461568570,2977.320000000000,0.032000000000
1461568570,2977.320000000000,0.076000000000
1461568570,2977.320000000000,0.076000000000
1461568569,2977.050000000000,0.050000000000
1461568569,2977.050000000000,0.050000000000
1461568569,2977.300000000000,0.021900000000
1461568569,2977.310000000000,0.021900000000
1461568569,2977.320000000000,0.050000000000
1461568423,2978.510000000000,0.500000000000
1461568421,2977.920000000000,0.023300000000
1461568421,2977.920000000000,0.010900000000
1461568421,2977.910000000000,0.165800000000
And I want to import the data into a pandas dataframe, but I want to restrict it to a subset of the data.
Now, pandas read_csv has the skiprows and skipfooter options where I can tell him to retrieve data only after a certain point row in the CSV file. But I want to specify the row number to start to read from to only catch the rows after a certain unix timestamp (so basically the line number of the 1st line that starts with a unix timestamp equal or higher than 1461568423 for example).
What would be an efficient to do this?
IIUC then you can do something like the following:
In [47]:
line=0
chunksz=3
for chunk in pd.read_csv(io.StringIO(t), header=None, names = ['timestamp','val1','val2'], chunksize=chunksz):
if len(chunk[chunk['timestamp'] == 1461568423]) > 0:
line += chunk[chunk['timestamp'] == 1461568423].index[0]
break
else:
line += chunksz
pd.read_csv(io.StringIO(t), header=None, names = ['timestamp','val1','val2'], skiprows=line)
Out[47]:
timestamp val1 val2
0 1461568423 2978.51 0.5000
1 1461568421 2977.92 0.0233
2 1461568421 2977.92 0.0109
3 1461568421 2977.91 0.1658
Here we set a line counter to 0 and a nominal chunksz, we iterate over the chunks until we find a match and then we use this as the param value for skiprows. This should be fast as we can set a large chunksize and keep skipping chunks where the row isn't found
I think you can use preprocessing with get_row which return number of row with timestamp and it is use for parameter skiprows in read_csv:
import pandas as pd
import csv
#preprocessing
def get_row(data):
with open('test.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
for i, row in enumerate(reader):
if row[0] == data:
return i
print get_row('1461568423')
8
df = pd.read_csv('test.csv', skiprows=get_row('1461568423'), header=None,names=['a','b','c'])
print df
a b c
0 1461568423 2978.51 0.5000
1 1461568421 2977.92 0.0233
2 1461568421 2977.92 0.0109
3 1461568421 2977.91 0.1658
Note: in the example you've provided, timestamps are ordered from the highest to the lowest.
Considering that you have a csv file like:
timestamp
15
14
13
...
2
1
You can read it in chunks (pd.read_csv is having such option):
import pandas as pd
LIMIT_TIMESTAMP = 5
df_reader = pd.read_csv('data.csv', iterator=True, chunksize=3)
df = pd.DataFrame()
for chunk in df_reader:
if chunk['timestamp'].min() < LIMIT_TIMESTAMP:
chunk = chunk[chunk['timestamp'] > LIMIT_TIMESTAMP]
df = pd.concat([df, chunk])
break
df = pd.concat([df, chunk])
df = df.reset_index(drop=True)
Results in:
timestamp
0 15
1 14
2 13
3 12
4 11
5 10
6 9
7 8
8 7
9 6
You do not have to double read the file. Just read in chunks of sensible size and stop reading after getting to the moment when you've reached you timestamp. And filter obsolete rows from the last chunk.

Categories

Resources