How do I replace cell values in pandas dataframe with the filename? - python

I'm appending several CSV files into one data frame, to then export into a combined CSV file. However, I need to replace the value "DTL" in the first column of each file with the filename so that the resulting data can still be tied back to each file.
Here's an example of my code:
import pandas as pd
import glob
import os
all_files=glob.glob("*.csv")
li=[]
for filename in all_files:
df=pd.read_csv(filename,index_col=None,header=None,sep='\t')
dfproper=df.drop(0,0)
dfproper.replace(to_replace="DTL",value=os.path.basename(filename),inplace=True)
li.append(dfproper)
df_final=pd.concat(li,axis=0,ignore_index=True)
df_final.to_csv("Combined.csv",index=None,header=None,sep='\t')
The code doesn't give me any errors in this form but also doesn't replace the "DTL" values.

Figured out the solution to this.
The separator I was using was '\t' for Tab, but as these are csvs they're quite literally comma separated. So I needed to omit the "sep" parameter, which then allowed the "DTL" to match the "to_replace" parameter.
I also cleaned the code up a bit by omitting the dropped header in the pd.read step.
Here's an example of the finished product.
import pandas as pd
import glob
import os
all_files=glob.glob("*.csv")
li=[]
for filename in all_files:
df=pd.read_csv(filename,index_col=None,header=None,skiprows=1,low_memory=False)
df.replace("DTL",os.path.basename(filename),inplace=True)
li.append(df)
df_final=pd.concat(li,axis=0,ignore_index=True)
df_final.to_csv("DHLE_20211114.csv",index=None,header=None)
df_final.head
Thanks to everyone who commented!

Related

Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.
Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.
import pandas as pd
import os
import glob
path = 'c:/mydata'
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do
If you want to change the column name, please try this:
import glob
import pandas as pd
filenames = glob.glob('c:/mydata/*.csv')
all_data = []
for file in filenames:
df = pd.read_csv(file)
if 'implementationstatus' in df.columns:
df = df.rename(columns={'implementationstatus':'implementation'})
all_data.append(df)
df_all = pd.concat(all_data, axis=0)
You can use a combination of header and names parameters from the pd.read_csv function to solve it.
You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.
From pandas docs:
names: array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

I need to capture date from multiple csv filenames and add that date in each file as a new column using Python

I need to capture date from multiple csv filenames and add that date in each file as a new column using Python , I have this code that works well with Excel files and I am trying to do exactly the same with CSV files, If someone could help me that would be much appreciated.
Filenames are as following...
Scan_05-22-2021.csv
Scan_05-23-2021.csv
Scan_05-24-2021.csv and so on..
Excel code that works..
import openpyexcel
import os
import pandas as pd
import glob
import csv
from openpyexcel import load_workbook
import os
path_to_xls = os.getcwd() # or r'<path>'
for xls in os.listdir ('C:\Python'):
if xls.endswith(".csv") or xls.endswith(".xlsx"):
f = load_workbook(filename=xls)
sheet = f.active
# Change here the name of the new column
sheet.cell(row=1, column=25).value = "DateTest"
for i in range(sheet.max_row-1):
#xls.split('_')[1][:-5] #kaes value of Col1 and dumps/overwrites in column 3
sheet.cell(row=i+2, column=25).value = xls.split('_')[1][:-5]
f.save(xls)
f.close()
You should be able to do this with pandas
use pd.read_csv to load the files as DataFrames
you can use the iterrows method to go ever rows
and simply append to the new file.
this cheatsheet could be of use
Good luck!

Create n data frames using a for loop

I would like to know how to name in a different way the data frames that I am going to create using the code below.
import pandas as pd
import glob
os.chdir("/Users/path")
dataframes=[]
paths = glob.glob("*.csv")
for path in paths:
dataset= pd.read_csv(path)
dataframes.append(dataset)
I would like to have something like this:
df1
df2
df3
....
in order to use each of them for different analysis purposes. In the folder I have files like
analysis_for_market.csv, dataset_for_analysis.csv, test.csv, ...
Suppose I have 23 csv files (this length is given by dataframes as it appends each of df).
For each of them I would like to create a dataframe df in python in order to run different analysis.
I would do for one of it:
df=pd.read_csv(path) (where path is "/path/analysis_for_market.csv").
and then I could work on it (adding columns, dropping them, and so on).
However, I would like also to be able to work with another dataset, let say dataset_for_analysis.csv, so I would need to create a new dataframe, df2. This could be useful in case I would like to compare rows.
And so on. Potentially I would need a df for each dataset, so I would need 23 df.
I think it could be done using a for loop, but I have not idea on how to call the df(for example, execute df.describe for the two examples above).
Could you please tell me how to do this?
If you find a possible question related to mine, could you please add it in a comment, before closing my question (as a previous post was closed before solving my issues)?
Thank you for your help and understanding.
Update:
import os
import pandas as pd
import glob
os.chdir("/Users/path")
paths = glob.glob("*.csv")
dataframes=[]
df={}
for x in range(1,len(paths)):
for path in paths:
df["0".format(x)]=pd.read_csv(path)
#dataframes[path] = df # it gives me the following error: TypeError: list indices must be integers or slices, not str
df["2"]
it works only for 0 as in the code, but I do not know how to let the value ranges between 1 and len(paths)
Setting the name of dataframe will do the job.
import pandas as pd
import glob
import os
os.chdir("/Users/path")
df = {}
paths = glob.glob("*.csv")
for index, path in enumerate(paths):
df[str(index)]= pd.read_csv(path)
This is working fine for me. If i call df['0'], this is giving me the first dataframe.
You can create a global variable with any name you like by doing
"globals()["df32"] = ..."
But that is usually viewed as poor coding practice (because you might be clobbering existing names without knowing it).
Instead, just create a dictionary mydfs (say) and do mydfs[1]=...
from glob import glob
import pandas as pd
for i, path in enumerate(glob('*.csv')):
exec("{} = {}".format("df{0:03d}".format(i), pd.read_csv(path, encoding = 'latin-1')))
You can adjust the 0:03d bit to the number of leading zeros you'd like if you need to or can just skip it alltogether with df{i}.

Deleting rows from several CSV files using Python

I wanted to delete specific rows from every single csv. files in my directory (i.e. from row 0 to 33), but I have 224 separate csv. files which need to be done. I would be happy if you help me how can I use one code to carry out this.
I think you can use glob and pandas to do this quite easily, I'm not sure if you want to write over your original files something I never recommend, so be careful as this code will do that.
import os
import glob
import pandas as pd
os.chdir(r'yourdir')
allFiles = glob.glob("*.csv") # match your csvs
for file in allFiles:
df = pd.read_csv(file)
df = df.iloc[33:,] # read from row 34 onwards.
df.to_csv(file)
print(f"{file} has removed rows 0-33")
or something along those lines..
This is a simple combination of two separate tasks.
First, you need to loop through all the csv files in a folder. See this StackOverflow answer for how to do that.
Next, within that loop, for each file, you need to modify the csv by removing rows. See this answer for how to read a csv, write a csv, and omit certain rows based on a condition.
One final aspect is that you want to omit certain line numbers. A good way to do this is with the enumerate function.
So code such as this will give you the line numbers.
import csv
input = open('first.csv', 'r')
output = open('first_edit.csv', 'w')
writer = csv.writer(output)
for i, row in enumerate(input):
if i > 33:
writer.writerow(row)
input.close()
output.close()
Iterate over CSV files and use Pandas to remove the top 34 rows of each file then save it to an output directory.
Try this code after installing pandas:
from pathlib import Path
import pandas as pd
source_dir = Path('path/to/source/directory')
output_dir = Path('path/to/output/directory')
for file in source_dir.glob('*.csv'):
df = pd.read_csv(file)
df.drop(df.head(34).index, inplace=True)
df.to_csv(output_dir.joinpath(file.name), index=False)

Renaming all the excel files as per the list in DataFrame in Python

I have approximately 300 files which are to be renamed as per the excel sheet mentioned below
The folder looks something like this :
I have tried writing following code, I think there will be a need of looping aswell. But it is not able to rename even one file. Any clue how this can be corrected.
import os
import pandas as pd
os.path.abspath('C:\\Users\\Home\\Desktop')
master=pd.read_excel('C:\\Users\\Home\\Desktop\\Test_folder\\master.xlsx')
master['old']=
('C:\\Users\\Home\\Desktop\\Test_folder\\'+master['oldname']+'.xlsx')
master['new']=
('C:\\Users\\Home\\Desktop\\Test_folder\\'+master['newname']+'.xlsx')
newmaster=master[['old','new']]
os.rename(newmaster['old'],newmaster['new'])
Load stuff.
import os
import pandas as pd
master = pd.read_excel('C:\\Users\\Home\\Desktop\\Test_folder\\master.xlsx')
Set your current directory to the folder.
os.chdir('C:\\Users\\Home\\Desktop\\Test_folder\\')
Rename things one at a time. While it would be cool, os.rename is not designed to work with pandas.
for row in master.iterrows():
oldname, newname = row[1]
os.rename(oldname+'.xlsx', newname+'.xlsx')
Basically, you are passing two pandas Series into os.rename() which expects two strings. Consider passing each Series values elementwise using apply(). And use the os-agnostic, os.path.join to concatenate folder and file names:
import os
import pandas as pd
cd = r'C:\Users\Home\Desktop\Test_folder'
master = pd.read_excel(os.path.join(cd, 'master.xlsx'))
def change_names(row):
os.rename(os.path.join(cd, row[0] +'.xlsx'), os.path.join(cd, row[1] +'.xlsx'))
master[['oldname', 'newname']].apply(change_names, axis=1)

Categories

Resources