I want to read excel file with Pandas, delete the header row and the first column and write the resultant data in an excel file with the same name. I want to do it for all the excel files in a folder. I have written the code for data reading and writing but having trouble with saving the data in a file with the same name. The code I have written is like this-
import numpy as np
import pandas as pd
import os
for filename in os.listdir ('./'):
if filename.endswith ('.xlsx'):
df = pd.read_excel ('new.xlsx', skiprows=1)
df.drop (df.columns [0], axis=1, inplace=True)
df.to_csv ('new.csv', index=False)
How can I automate my code for all the excel files in the same folder?
Use variable filename in function read_excel and then create new file names by format and for remove first column is possible use DataFrame.iloc - select all columns without first:
for filename in os.listdir ('./'):
if filename.endswith ('.xlsx'):
df = pd.read_excel (filename, skiprows=1)
df.iloc[:, 1:].to_csv('new_{}.csv'.format(filename), index=False)
Another solution with glob, there is possible specify extensions:
import glob
for filename in glob.glob('./*.xlsx'):
df = pd.read_excel (filename, skiprows=1)
df.iloc[:, 1:].to_csv('new_{}.csv'.format(filename), index=False)
#python 3.6+
#df.iloc[:, 1:].to_csv (f'new_{filename}.csv', index=False)
Try Below for reading multiple files as follows:
import pandas as pd
import glob
# Read multiple files into one dataframe along with pandas `concat`
# if you have path defined like `/home/data/` then you can use `/home/data/*.xlsx` otherwise you directly mention the path.
df = pd.concat([pd.read_excel(files, sep=',', index=False, skiprows=1) for files in glob.glob("/home/data/*.xlsx")])
Alternative:
Read multiple files into one dataframe
all_Files = glob.glob('/home/data/*.xlsx')
df = pd.concat((pd.read_excel(files, sep=',', index=False, skiprows=1) for files in all_Files))
Related
I have many CSV files under subdirectories in one folder. They all contain tweets and other metadata. I am interested in removing most of these metadata and keeping the tweets themselves and their time. I used glob to read the files, and the removing part seems to be working fine. However, I am not sure how to save the output so that all files are saved and with their original file name.
import pandas as pd
import glob
path = r'D:\tweets'
myfiles= glob.glob(r'D:\tweets\**\*.csv', recursive=True)
for f in myfiles:
df = pd.read_csv(f)
df = df.drop(["name", "id","conversation_id","created_at","date"], axis=1)
df = df[df["language"].str.contains("bn|ca|ckbu|id||zh")==False]
df.to_csv("output_filename.csv", index=False, encoding='utf8')
If you do it this way, it will overwrite the same file:
for f in myfiles:
df = pd.read_csv(f)
df = df.drop(["name", "id","conversation_id","created_at","date"], axis=1)
df = df[df["language"].str.contains("bn|ca|ckbu|id||zh")==False]
df.to_csv(f, index=False, encoding='utf8')
I have been trying to build a script in python that pulls the info from a set of csv files. The format of the csv is as follows and has no header: ['Day','Hour','Seconds','Microsecods','x_accel','y_accel']. Instead of inputting the values in the correspondent columns pandas is pulling the values and making them a string like this:" 9,40,19,65664,-0.527,-0.333" in the first column. I tried using dtype and sep=',' but did not work. I don't understand why it does not fit them properly in the right columns.
This is my script:
import numpy as np
import os
import pandas as pd
os.chdir('C:/Users/pc/Desktop/41x/Learning_set/Bearing1_1')
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
columns = ['Day','Hour','Seconds','Microsecods','x_accel','y_accel']
for f in files:
data = pd.read_csv(f, 'Sheet1', header = None,engine='python',names=columns)
df = df.append(data)
print(df)
This is the pd output db:
This is snap of the csv:
You're using the read_csv function but in your arguments you are implying that the separator value is 'Sheet1':
pd.read_csv(f, 'Sheet1', header=None, engine='python', names=columns)
Is it a CSV or is it from an Excel file. If it is a CSV then most likely you can just remove this and it will work as expected.
I coded how to load and save txt file using pandas in python.
import glob
filenames = sorted(glob.glob("D:/a/test*.txt"))
filenames = filenames[0:5]
import numpy as np
import pandas as pd
for f in filenames:
df = pd.read_csv(f, skiprows=[1,2,3], dtype=str, delim_whitespace=True)
df.to_csv(f'{f[:-4]}.csv', index=False)
------>There are 10 result files in a folder
- test1.txt, test2.txt, test3.txt, test4.txt, test5.txt,
test1.csv, test2.csv, test3.csv, test4.csv, test5.csv
#Result csv.file(test1.csv)
abc:,1.233e-03
1.234e-04,
1.235e-02,
1.236e-05,
1.237e-02,
1.238e-02,
But I have some problems as follows.
I don't know how to rename test1.txt, test2.txt, test3.txt, test4.txt, test5.txt into c1.csv, c2.csv, c3.csv, c4.csv, c5.csv.
I want remove 'abc:,'data in all test(1,2,3,4,5).csv files, but I don't know how to delete and replace data.
Do you know how to rename(change) file name and remove data (specific character) referred to above problems in python?
origin data
test1.txt (it is similar to other file-{test2,test3,test4, test5}.txt)
abc: 1.233e-03
def: 1.64305155216978164e+02
ghi: 4831
jkl:
1.234e-04
1.235e-02
1.236e-05
1.237e-02
1.238e-02
Expected result
(it must chage test1,2,3,4,5.txt files into c1,2,3,4,5.csv files, it only remove herder name(abc:), also def:,ghi:,jkl: rows should remove.)
1.233e-03
1.234e-04
1.235e-02
1.236e-05
1.237e-02
1.238e-02
You can rename your file using os.rename() method or you can create a new file using f= open("file_name.extension","w+") and write the output to new file.
You can use the replace() method once you load your data into a string variable.
Here is what you can do:
import os
import glob
import pandas as pd
for f in glob.glob("*.txt"):
df = pd.read_csv(f, skiprows=[1,2,3], dtype=str, delim_whitespace=True)
df = df.replace('abc:,','')
os.rename(f,f'{f[:-4]}.csv')
df.to_csv(f'{f[:-4]}.csv', index=False)
I need to extract an Excel worksheet from multiple workbooks and saving it to a dataframe and in turn saving that dataframe.
I have a spreadsheet that is generated at the end of each month (e.g.
June 2019.xlsx, May 2019.xlsx, April 2019.xlsx).
I need to grab a worksheet 'Sheet1'from each of these workbooks and convert these to a dataframe (df1).
I would like to have this dataframe saved.
As a nice to have, I would also like some way just to append the next month's data after the initial 'data grab'.
I'm relatively new to this, so I haven't made much progress.
import os
import glob
import pandas as pd
import xlrd
import json
import io
import flatten_json
files = glob.glob('/Users/ngove/Documents/Python Scripts/2019/*.xlsx')
dfs={}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
You can drop all of your files in a directory (e.g. current directory). Then append all of your excel files in a list (e.g. files_xls). Iterate over all your files and use pandas.read_excel to get the respective dataframes (e.g. list_frames).
Below, you can find an example:
import os
import pandas as pd
path = os.getcwd() # get cur dir
files = os.listdir(path) # get all the files in your cur dir
# get only the xls or xlsm (this depends on you)
files_xls = [f for f in files if (f[-3:] == 'xls' or f[-4:] == 'xlsm')]
df = pd.DataFrame()
list_frames = []
for f in files_xls:
print("Processing file: %s" %f)
try:
# the following will give you the dataframe
# the fun params depends on your data format
data = pd.read_excel(f, 'Sheet1', header=0, index_col=None,
sep='delimiter', error_bad_lines=False,
skip_blank_lines=True, comment=',,')
except:
pass
list_frames.append(data)
# at the end you can concat your data if you want and remove any dublicate
df = pd.concat(list_frames, sort=False).fillna(0)
df = df.drop_duplicates()
# at the end you can save it
writer = pd.ExcelWriter("your_title" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name="Sheets1", index=False)
writer.save()
I hope this helps.
I interpreted your statement that you want to save the dataframe as that you want to save it as a combined Excel file. This will combine all files in the folder specified that end in xlsx.
import os
import pandas as pd
from pandas import ExcelWriter
os.chdir("H:/Python/Reports/") #edit this to be your path
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
df = pd.DataFrame()
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
writer=ExcelWriter('Combined_Data.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
You could update the code to grab all 2019 files by changing the one line to this:
files_xlsx = [f for f in files if f[-9:] == '2019.xlsx']
I referenced this question for most of the code and updated for xlsx and added the file save portion of the code
I am working with lots of csv files and need to add column. I tried glob, for example:
import glob
filenames = sorted(glob.glob('./DATA1/*2018*.csv'))
filenames = filenames[0:10]
import numpy as np
import pandas as pd
for f in filenames:
df = pd.read_csv(f, header=None, index_col=None)
df.columns = ['Date','Signal','Data','Code']
#this is what I should add to all csv files
df["ID"] = df["Data"].str.slice(0,2)
and I need a way to save the file back to csv (not concatenated) with different name such as "file01edited.csv" after I add the column to each csv file.
Use to_csv with f-strings for change file names:
for f in filenames:
df = pd.read_csv(f, names=['Date','Signal','Data','Code'], index_col=None)
#this is what I should add to all csv files
df["ID"] = df["Data"].str.slice(0,2)
#python 3.6+
df.to_csv(f'{f[:-4]}edited.csv', index=False)
#python bellow 3.6
#df.to_csv('{}edited.csv'.format(f[:-4]), index=False)