converting each sas dataset to dataframe in pandas - python

I am converting the each sas dataset from the list of directory to individual dataframe in pandas
import os
import pandas as pd
import pyreadstat as pyd
os.chdir(r'XX\XX\XX\XXX')
Assume the default directory contains the list of sasdatsets.
aa.sas7bdat
bb.sas7bdat
cc.sas7bdat
dd.sas7bdat
ee.sas7bdat
Now i am creating the dictionary where it iterates each sas datasets using pyd.read_sas7bdat import into the individual data frame.
ddict={}
for file in os.listdir():
if file.endswith(".sas7bdat"):
name = os.path.splitext(file)[0]
ddict[name]=pyd.read_sas7bdat(file,metadataonly=False)
But i am still not able to succeed with the above code and pls help how to achieve. My output expected was to create new dataframe for each sasdatasets i,e there shall be multiple dataframes Note: dataframe should be the name of the sasdatasets without the extension
for example aa.sas7bdat --> sas datasets aa --> to be created as
dataframe

I would do it like this:
import os
import pyreadstat as pyd
ddict={}
for file in os.listdir():
if file.endswith(".sas7bdat"):
name = os.path.splitext(file)[0]
df, meta = pyd.read_sas7bdat(file)
# store the dataframe in a dictionary
ddict[name]= df
# alternatively bind to a new variable name
exec(name + "= df.copy()")
Remember that read_sas7bdat gives you a tuple of dataframe and metadata object, not a dataframe only.

Related

How to modify the reading/parsing of an Excel Column in a data Frame?

I am trying to process an Excel file with Pandas. The filter to be applied is by the values of the "Test Code" column which has the format "XX10.X/XX12.X" (i.e: EF10.1). The problem is that the dataframe and neglects everything after the dot when reading the column, leaving just "XX10". The information after the dot is the most important information.
The original document classifies those cells as a date, which probably is altering the normal processing of the values.
excelfile
The code I am using is:
import os
import pandas as pd
file = "H2020_TRI-HP_T6.2_PropaneIceFaultTests_v1"
folder = "J:\Downloads"
file_path = os.path.join(folder,file+".xlsx")
df = pd.read_excel(file_path,sheet_name="NF10")
df["Test Code"]
The output is:
output

Pandas - import CSV files in folder, change column name if it contains a string, concat into one dataframe

I have a folder with about 40 CSV files containing data by month. I want to combine this all together, however I have one column in these CSV files that are either denoted as 'implementationstatus' or 'implementation'. When I try to concat using Pandas, obviously this is a problem. I want to basically change 'implementationstatus' to 'implementation' for each CSV file as it is imported. I could run a loop for each CSV file, change the column name, export it, and then run my code again with everything changed, but that just seems prone to error or unexpected things happening.
Instead, I just want to import all the CSVs, change the column name 'implementationstatus' to 'implementation' IF APPLICABLE, and then concatenate into one data frame. My code is below.
import pandas as pd
import os
import glob
path = 'c:/mydata'
filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths),join='inner', ignore_index=True)
df.columns = df.columns.str.replace('implementationstatus', 'implementation') # I know this doesn't work, but I am trying to demonstrate what I want to do
If you want to change the column name, please try this:
import glob
import pandas as pd
filenames = glob.glob('c:/mydata/*.csv')
all_data = []
for file in filenames:
df = pd.read_csv(file)
if 'implementationstatus' in df.columns:
df = df.rename(columns={'implementationstatus':'implementation'})
all_data.append(df)
df_all = pd.concat(all_data, axis=0)
You can use a combination of header and names parameters from the pd.read_csv function to solve it.
You must pass to names a list containing the names for all columns on the csv files. This will allow you to standardize all names.
From pandas docs:
names: array-like, optional
List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

Why cant I extract a single column using pandas?

I have a (theoretically) simple task. I need to pull out a single column of 4000ish names from a table and use it in another table.
I'm trying to extract the column using pandas and I have no idea what is going wrong. It keeps flagging an error:
TypeError: string indices must be integers
import pandas as pd
file ="table.xlsx"
data = file['Locus tag']
print(data)
You have just add file name and define the path . But you cannot load the define pandas read excel function . First you have just the read excel function from pandas . That can be very helpful to you read the data and extract the column etc
Sample Code
import pandas as pd
import os
p = os.path.dirname(os.path.realpath("C:\Car_sales.xlsx"))
name = 'C:\Car_sales.xlsx'
path = os.path.join(p, name)
Z = pd.read_excel(path)
Z.head()
Sample Code
import pandas as pd
df = pd.read_excel("add the path")
df.head()

Renaming all the excel files as per the list in DataFrame in Python

I have approximately 300 files which are to be renamed as per the excel sheet mentioned below
The folder looks something like this :
I have tried writing following code, I think there will be a need of looping aswell. But it is not able to rename even one file. Any clue how this can be corrected.
import os
import pandas as pd
os.path.abspath('C:\\Users\\Home\\Desktop')
master=pd.read_excel('C:\\Users\\Home\\Desktop\\Test_folder\\master.xlsx')
master['old']=
('C:\\Users\\Home\\Desktop\\Test_folder\\'+master['oldname']+'.xlsx')
master['new']=
('C:\\Users\\Home\\Desktop\\Test_folder\\'+master['newname']+'.xlsx')
newmaster=master[['old','new']]
os.rename(newmaster['old'],newmaster['new'])
Load stuff.
import os
import pandas as pd
master = pd.read_excel('C:\\Users\\Home\\Desktop\\Test_folder\\master.xlsx')
Set your current directory to the folder.
os.chdir('C:\\Users\\Home\\Desktop\\Test_folder\\')
Rename things one at a time. While it would be cool, os.rename is not designed to work with pandas.
for row in master.iterrows():
oldname, newname = row[1]
os.rename(oldname+'.xlsx', newname+'.xlsx')
Basically, you are passing two pandas Series into os.rename() which expects two strings. Consider passing each Series values elementwise using apply(). And use the os-agnostic, os.path.join to concatenate folder and file names:
import os
import pandas as pd
cd = r'C:\Users\Home\Desktop\Test_folder'
master = pd.read_excel(os.path.join(cd, 'master.xlsx'))
def change_names(row):
os.rename(os.path.join(cd, row[0] +'.xlsx'), os.path.join(cd, row[1] +'.xlsx'))
master[['oldname', 'newname']].apply(change_names, axis=1)

Python: Import Excel Data and lookup values in Dictionary

Total beginner to python: Trying to import excel values from a column. Lookup the imported values in python dictionary (was able to create this) and then write the results into the excel file and see if they match to another column in the file.
You can use a module called pandas.
pip install pandas
To read the file use the following:
import pandas as pd
file = pd.ExcelFile('path/to/excelsheet/').parse('sheet_you_want_to_use') # 'Sheet 1' for Sheet 1
you can now access the columns using the column names as keys: file['column_name'].
You can now append the looked up values to a list. Then write to a excel file as follows:
list = ['....values....']
pd.DataFrame(list).to_excel('where/to/save/file')
I would advise you to read the following documentation:
pandas DataFrame
pandas ExcelFile
pandas to_excel
pandas

Categories

Resources