Create n data frames using a for loop - python

I would like to know how to name in a different way the data frames that I am going to create using the code below.
import pandas as pd
import glob
os.chdir("/Users/path")
dataframes=[]
paths = glob.glob("*.csv")
for path in paths:
dataset= pd.read_csv(path)
dataframes.append(dataset)
I would like to have something like this:
df1
df2
df3
....
in order to use each of them for different analysis purposes. In the folder I have files like
analysis_for_market.csv, dataset_for_analysis.csv, test.csv, ...
Suppose I have 23 csv files (this length is given by dataframes as it appends each of df).
For each of them I would like to create a dataframe df in python in order to run different analysis.
I would do for one of it:
df=pd.read_csv(path) (where path is "/path/analysis_for_market.csv").
and then I could work on it (adding columns, dropping them, and so on).
However, I would like also to be able to work with another dataset, let say dataset_for_analysis.csv, so I would need to create a new dataframe, df2. This could be useful in case I would like to compare rows.
And so on. Potentially I would need a df for each dataset, so I would need 23 df.
I think it could be done using a for loop, but I have not idea on how to call the df(for example, execute df.describe for the two examples above).
Could you please tell me how to do this?
If you find a possible question related to mine, could you please add it in a comment, before closing my question (as a previous post was closed before solving my issues)?
Thank you for your help and understanding.
Update:
import os
import pandas as pd
import glob
os.chdir("/Users/path")
paths = glob.glob("*.csv")
dataframes=[]
df={}
for x in range(1,len(paths)):
for path in paths:
df["0".format(x)]=pd.read_csv(path)
#dataframes[path] = df # it gives me the following error: TypeError: list indices must be integers or slices, not str
df["2"]
it works only for 0 as in the code, but I do not know how to let the value ranges between 1 and len(paths)

Setting the name of dataframe will do the job.
import pandas as pd
import glob
import os
os.chdir("/Users/path")
df = {}
paths = glob.glob("*.csv")
for index, path in enumerate(paths):
df[str(index)]= pd.read_csv(path)
This is working fine for me. If i call df['0'], this is giving me the first dataframe.

You can create a global variable with any name you like by doing
"globals()["df32"] = ..."
But that is usually viewed as poor coding practice (because you might be clobbering existing names without knowing it).
Instead, just create a dictionary mydfs (say) and do mydfs[1]=...

from glob import glob
import pandas as pd
for i, path in enumerate(glob('*.csv')):
exec("{} = {}".format("df{0:03d}".format(i), pd.read_csv(path, encoding = 'latin-1')))
You can adjust the 0:03d bit to the number of leading zeros you'd like if you need to or can just skip it alltogether with df{i}.

Related

How to make pandas ignore hidden sheets or how to read_excel for one of two specified sheet_names

I have a loop for importing some data from a large number of excels. However, some of them have hidden first tabs. I would pull a specific tab by name, but there are two different naming conventions used.
This is my code
import pandas as pd
import string
import glob
import os
directory = 'file path'
files = os.listdir(directory)
list_of_dfs = []
os.chdir("C:/Users/nlarmann/Desktop/Q3_JVs")
for file in files:
df = pd.read_excel (file)
df = df.T
df = df.iloc[[1],:24]
list_of_dfs.append(df)
data_combined = pd.concat(list_of_dfs)
data_combined.to_excel('filepath/output.xlsx', index=False)
I know I could specify a sheet name to target the tab I want, but I am unsure how to make python try two different names, but not require them to be found. I am looking for either a way to make python check two naming conventions, or just ignore all hidden sheets. (The one I want is the first visible sheet).
I realize there is a way to identify if a sheet is hidden or not, but I am uncertain on how to integrate that into my existing code.
I appreciate any assistance.

Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.
Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.
import pandas as pd
import os
import glob
path = 'c:/mydata'
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do
If you want to change the column name, please try this:
import glob
import pandas as pd
filenames = glob.glob('c:/mydata/*.csv')
all_data = []
for file in filenames:
df = pd.read_csv(file)
if 'implementationstatus' in df.columns:
df = df.rename(columns={'implementationstatus':'implementation'})
all_data.append(df)
df_all = pd.concat(all_data, axis=0)
You can use a combination of header and names parameters from the pd.read_csv function to solve it.
You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.
From pandas docs:
names: array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

How do I replace cell values in pandas dataframe with the filename?

I'm appending several CSV files into one data frame, to then export into a combined CSV file. However, I need to replace the value "DTL" in the first column of each file with the filename so that the resulting data can still be tied back to each file.
Here's an example of my code:
import pandas as pd
import glob
import os
all_files=glob.glob("*.csv")
li=[]
for filename in all_files:
df=pd.read_csv(filename,index_col=None,header=None,sep='\t')
dfproper=df.drop(0,0)
dfproper.replace(to_replace="DTL",value=os.path.basename(filename),inplace=True)
li.append(dfproper)
df_final=pd.concat(li,axis=0,ignore_index=True)
df_final.to_csv("Combined.csv",index=None,header=None,sep='\t')
The code doesn't give me any errors in this form but also doesn't replace the "DTL" values.
Figured out the solution to this.
The separator I was using was '\t' for Tab, but as these are csvs they're quite literally comma separated. So I needed to omit the "sep" parameter, which then allowed the "DTL" to match the "to_replace" parameter.
I also cleaned the code up a bit by omitting the dropped header in the pd.read step.
Here's an example of the finished product.
import pandas as pd
import glob
import os
all_files=glob.glob("*.csv")
li=[]
for filename in all_files:
df=pd.read_csv(filename,index_col=None,header=None,skiprows=1,low_memory=False)
df.replace("DTL",os.path.basename(filename),inplace=True)
li.append(df)
df_final=pd.concat(li,axis=0,ignore_index=True)
df_final.to_csv("DHLE_20211114.csv",index=None,header=None)
df_final.head
Thanks to everyone who commented!

Renaming all the excel files as per the list in DataFrame in Python

I have approximately 300 files which are to be renamed as per the excel sheet mentioned below
The folder looks something like this :
I have tried writing following code, I think there will be a need of looping aswell. But it is not able to rename even one file. Any clue how this can be corrected.
import os
import pandas as pd
os.path.abspath('C:\\Users\\Home\\Desktop')
master=pd.read_excel('C:\\Users\\Home\\Desktop\\Test_folder\\master.xlsx')
master['old']=
('C:\\Users\\Home\\Desktop\\Test_folder\\'+master['oldname']+'.xlsx')
master['new']=
('C:\\Users\\Home\\Desktop\\Test_folder\\'+master['newname']+'.xlsx')
newmaster=master[['old','new']]
os.rename(newmaster['old'],newmaster['new'])
Load stuff.
import os
import pandas as pd
master = pd.read_excel('C:\\Users\\Home\\Desktop\\Test_folder\\master.xlsx')
Set your current directory to the folder.
os.chdir('C:\\Users\\Home\\Desktop\\Test_folder\\')
Rename things one at a time. While it would be cool, os.rename is not designed to work with pandas.
for row in master.iterrows():
oldname, newname = row[1]
os.rename(oldname+'.xlsx', newname+'.xlsx')
Basically, you are passing two pandas Series into os.rename() which expects two strings. Consider passing each Series values elementwise using apply(). And use the os-agnostic, os.path.join to concatenate folder and file names:
import os
import pandas as pd
cd = r'C:\Users\Home\Desktop\Test_folder'
master = pd.read_excel(os.path.join(cd, 'master.xlsx'))
def change_names(row):
os.rename(os.path.join(cd, row[0] +'.xlsx'), os.path.join(cd, row[1] +'.xlsx'))
master[['oldname', 'newname']].apply(change_names, axis=1)

Python read in multiple .txt files and row bind using pandas

I'm coming from R (and SAS) and am having an issue reading in a large set of .txt files (all stored in the same directory), and creating one large dataframe in pandas. So far I have attempted an amalgamation of code - all of which fails miserably. I assume this is a simple task but lack the experience in python...
If it helps this is the data I would like to create one large dataframe with: http://www.ssa.gov/oact/babynames/limits.html
- the state specific sets (50 in total, named for their state abbreviation.txt)
Please help!
import pandas as pd
import glob
filelist = glob.glob("C:\Users\Dell\Downloads\Names\*.txt")
names = ['state', 'gender', 'year', 'name', 'count']
Then, I was thinking of using pd.concat, but am not sure - essentially I want to read in each dataset and then row.bind the sets together (given they all have the same columns)...
concat is nice since "join" is set to "outer" (i.e. union of index) by default. You could just as easily use df.join(), but must specify "how" as "outer". Either way, you can build a dataframe quite simply:
import pandas as pd
from glob import glob as gg
data = pd.DataFrame()
names = ['state', 'gender', 'year', 'name', 'count']
for f in gg('*.txt'):
tmp = pd.read_csv(f,columns=names)
data = pd.concat([data,tmp],axis=0,ignore_index=True)

Categories

Resources