I would need to create a dataframe for each of the following datasets (csv files stored a folder):
0 text1.csv
1 text2.csv
2 text3.csv
3 text4.csv
4 text5.csv
The above list is created using os.chdir and lists all the csv files included in a folder in the following path:
os.chdir("path")
To create the dataframe (to be used later on) for each of the datasets above, I am doing as follow:
texts=[]
for item in glob.glob("*.csv"):
texts.append(item)
for (x,z) in enumerate(texts):
print(x,z)
df = pd.read_csv(datasets[int(x)])
df.index.name = datasets[int(x)]
However, it does not create any dataframe. I think the problem is in df, as I am not distinguishing it for each dataset (but I am only trying to read each dataset using pd.read.csv(datasets[int(x)])).
Could you please tell me how to create a dataframe per each of the datasets (for example df1 related to text1, df2 related to text2, and so on)?
Thank you for your help.
I'd use a function and return a list of the dataframes
Simple, one-liner function:
import glob
import pandas as pd
def get_all_csv(path, sep=','):
# read all the csv files in a directory to a list of dataframes
return [pd.read_csv(csv_file, sep=sep)
for csv_file in glob.glob(path + "*.csv")]
# get all the csv in the current directory
dfs = get_all_csv('./', sep=';')
print(dfs)
Is a list of dataframes what you are looking for?
import pandas as pd
import glob
results=[]
paths = glob.glob("*.csv"):
for path in paths:
df = pd.read_csv(path)
results.append(df)
Related
I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`
I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())
I have to merge different csv files which contain features about a place based on place_id into one so that I can create a model to predict a rating for a particular place.
I have already tried using pandas.concat and merging the files through linux terminal but I just get null values for all the other features as the place_id keeps on repeating
#importing libraries
import pandas as pd
import numpy as np
import glob
#creating a single dataframe
fileList = glob.glob('chef*.csv')
fileList.append('rating_final.csv')
dfList = []
for file in fileList:
print(file)
df = pd.read_csv(file)
dfList.append(df)
concatDf = pd.concat(dfList, axis=0)
I expect to get a csv file with different features according to a single place_id but what I get is a csv file in which place_id keeps on repeating with a single feature only.
Try this,
import pandas as pd
df2 = pd.read_csv('rating_final.csv')
df2.to_csv('chef*.csv', mode='a', header=False, index=False)
test_df = pd.concat([pd.read_csv('chef*.csv'), df2], ignore_index=True, sort=True)
print(test_df)
The merged output will be available in chef*.csv file.
I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")
I have around 20++ xlsx files, inside each xlsx files might contain different numbers of worksheets. But thank god, all the columns are the some in all worksheets and all xlsx files. By referring to here", i got some idea. I have been trying a few ways to import and append all excel files (all worksheet) into a single dataframe (around 4 million rows of records).
Note: i did check here" as well, but it only include file level, mine consits file and down to worksheet level.
I have tried below code
# import all necessary package
import pandas as pd
from pathlib import Path
import glob
import sys
# set source path
source_dataset_path = "C:/Users/aaa/Desktop/Sample_dataset/"
source_dataset_list = glob.iglob(source_dataset_path + "Sales transaction *")
for file in source_dataset_list:
#xls = pd.ExcelFile(source_dataset_list[i])
sys.stdout.write(str(file))
sys.stdout.flush()
xls = pd.ExcelFile(file)
out_df = pd.DataFrame() ## create empty output dataframe
for sheet in xls.sheet_names:
sys.stdout.write(str(sheet))
sys.stdout.flush() ## # View the excel files sheet names
#df = pd.read_excel(source_dataset_list[i], sheet_name=sheet)
df = pd.read_excel(file, sheetname=sheet)
out_df = out_df.append(df) ## This will append rows of one dataframe to another(just like your expected output)
Question:
My approach is like first read the every single excel file and get a list of sheets inside it, then load the sheets and append all sheets. The looping seems not very efficient expecially when datasize increase for every append.
Is there any other efficient way to import and append all sheets from multiple excel files?
Use sheet_name=None in read_excel for return orderdict of DataFrames created from all sheetnames, then join together by concat and last DataFrame.append to final DataFrame:
out_df = pd.DataFrame()
for f in source_dataset_list:
df = pd.read_excel(f, sheet_name=None)
cdf = pd.concat(df.values())
out_df = out_df.append(cdf,ignore_index=True)
Another solution:
cdf = [pd.read_excel(excel_names, sheet_name=None).values()
for excel_names in source_dataset_list]
out_df = pd.concat([pd.concat(x) for x in cdf], ignore_index=True)
If i understand your problem correctly, set sheet_name=None in pd.read_excel does the trick.
import os
import pandas as pd
path = "C:/Users/aaa/Desktop/Sample_dataset/"
dfs = [
pd.concat(pd.read_excel(path + x, sheet_name=None))
for x in os.listdir(path)
if x.endswith(".xlsx") or x.endswith(".xls")
]
df = pd.concat(dfs)
I have a pretty straight forward solution if you want to read all the sheets.
import pandas as pd
df = pd.concat(pd.read_excel(path+file_name, sheet_name=None),
ignore_index=True)