I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")
Related
I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`
I have a project to get any csv file in one specifc location and convert each csv into one dataframe.
As you can see, it was not necessary to put the names of the files, it simply takes every file it finds and converts it into a list, but now I want to convert those lists into daframes, the name of the dataframes can be generic. Any way to do this? how to get out of the list these dataframes?
import os
import pandas as pd
from IPython.display import display
# assign path
path, dirs, files = next(os.walk("/Users/user/Documents/appfolio"))
file_count = len(files)
# create empty list
dataframes_list = []
# append datasets to the list
for i in range(file_count):
temp_df = pd.read_csv("/Users/user/Documents/appfolio/"+files[i] , encoding='windows-1252')
dataframes_list.append(temp_df)
for dataset in dataframes_list:
display(dataset)
I'm working with around 30 equally-looking csv files in Pandas (one-minute timeseries data, one year each, ~100 MB). Mostly, I do the same operation on each of the 30 dataframes. Is there a convenient way to apply an operation on each of the dataframes at once but keeping the files separately? Something like this?
for df in df1,df2,df3:
df=df.dropna(subset=['A','B'])
df['C'] = df['A']/df['B']
df_a = df[(df.C >= 50)]
The files can be manipulated or filtered as you need and then stored in a dictionary for review
Each new dataframe is a value in the dictionary
Use pathlib to find all the files
Treats file names and paths as objects with methods
.stem gets the file name, which can be used as part of the key for the stored dataframes
from pathlib import Path
import pandas as pd
p = Path('csv_files/Data/') # path to files
files = list(p.glob('*.csv')) # list of files
df_dict = dict()
for file in files:
df = pd.read_csv(file) # create dataframe from file
df.dropna(subset=['A', 'B'], inplace=True)
df['C'] = df['A'] / df['B']
key = f'filtered_{file.stem}'
df_dict[key] = df[(df.C >= 50)] # create a new dataframe and store it in a dictionary
# get list of keys
print(df_dict.keys())
# access a dataframe
df_dict['filtered_file_name']
# save dataframes
for k, v in df_dict.items():
v.to_csv(f'{k}.csv')
f-strings
f-Strings: A New and Improved Way to Format Strings in Python
PEP 498 - Literal String Interpolation
pathlib
pathlib, part of the standard library.
Python 3's pathlib Module: Taming the File System
I would need to create a dataframe for each of the following datasets (csv files stored a folder):
0 text1.csv
1 text2.csv
2 text3.csv
3 text4.csv
4 text5.csv
The above list is created using os.chdir and lists all the csv files included in a folder in the following path:
os.chdir("path")
To create the dataframe (to be used later on) for each of the datasets above, I am doing as follow:
texts=[]
for item in glob.glob("*.csv"):
texts.append(item)
for (x,z) in enumerate(texts):
print(x,z)
df = pd.read_csv(datasets[int(x)])
df.index.name = datasets[int(x)]
However, it does not create any dataframe. I think the problem is in df, as I am not distinguishing it for each dataset (but I am only trying to read each dataset using pd.read.csv(datasets[int(x)])).
Could you please tell me how to create a dataframe per each of the datasets (for example df1 related to text1, df2 related to text2, and so on)?
Thank you for your help.
I'd use a function and return a list of the dataframes
Simple, one-liner function:
import glob
import pandas as pd
def get_all_csv(path, sep=','):
# read all the csv files in a directory to a list of dataframes
return [pd.read_csv(csv_file, sep=sep)
for csv_file in glob.glob(path + "*.csv")]
# get all the csv in the current directory
dfs = get_all_csv('./', sep=';')
print(dfs)
Is a list of dataframes what you are looking for?
import pandas as pd
import glob
results=[]
paths = glob.glob("*.csv"):
for path in paths:
df = pd.read_csv(path)
results.append(df)
I have given 5 CSV file, now I want to combine all the data from these file into one single table.
I have tried pd.concat and .join from pandas so far, can only get only two files combined. so far I've tried the following
data = pd.read_csv('data.csv')
data1 = pd.read_csv('data2.csv)
merge = data.join(data1,lsuffix='_NOM',rSuffix='_NIM')
in the end, I want to have every data side by side in my table.sample data.csv
You just loop through the directory which contains the .csv files. For example, refer below:
import glob
df = pd.DataFrame() # An empty data frame
for filename in glob.glob('./<path to your data files>/*.csv'):
df_temp = pd.read_csv(filename)
df = df.append(df_temp)