I have a project to get any csv file in one specifc location and convert each csv into one dataframe.
As you can see, it was not necessary to put the names of the files, it simply takes every file it finds and converts it into a list, but now I want to convert those lists into daframes, the name of the dataframes can be generic. Any way to do this? how to get out of the list these dataframes?
import os
import pandas as pd
from IPython.display import display
# assign path
path, dirs, files = next(os.walk("/Users/user/Documents/appfolio"))
file_count = len(files)
# create empty list
dataframes_list = []
# append datasets to the list
for i in range(file_count):
temp_df = pd.read_csv("/Users/user/Documents/appfolio/"+files[i] , encoding='windows-1252')
dataframes_list.append(temp_df)
for dataset in dataframes_list:
display(dataset)
Related
I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`
enter image description here
https://github.com/CSSEGISandData/COVID-19/tree/f57525e860010f6c5c0c103fd97e2e7282b480c8/csse_covid_19_data/csse_covid_19_daily_reports
In JHU covid-19 dataset, I hope to get the death number of suffolk country. And add a new column "death" in the cdc vaccine dataframe(I plot in the picture). You can see that the covid data is in every day records. So how do I achieve it in python?
so first you want to get a list of the files
from glob import glob
files = [item for item in glob("directory_that_contains_csvs/*.csv)]
The next step is that you can the iterate over these files and concatenate them:
all_files = []
for file in files:
df_file = pd.read_csv(file)
all_files.append(df_file[df_file['column_name'] == 1])
df_all = pd.concat(all_files, axis=1)
And you should be good.
The process is that you:
Create a list of the files you want to load
Create a list that can be used to append all the dataframes
Use a for-loop to iterate over that list and append the loaded dataframe
Concatenate all the dataframes in the list to a single dataframe.
I hope this helps
I have a problem which might be easy for majority of people here.
I have four folders: SA1, SA2, SA3, SA4.
Each folder has around 60 csv files. I have defined the path like this:
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
Once the user defines the year and level in the path above, I to combine all the csv files under the "level" folder based on the "index_col=" parameter.
For exmaple, for SA1, I want to combine the CSV files based on the "SA1_code" column. For SA2, I want to combine the CSV files based on
the "SA2_MAIN_DIGIT_CODE" column. For SA3 and SA4, the index_col should be "SA3_MULTI" and "SA4_REGIONS" respectively. As you can see all the columns names for CSV files under these four folders are different.
So far, I have attempted the following things.
I have defined the function as
def combine_csv(path):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path+"/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename) # Can't give the "index_col" as there four different strings for teh csv files in each folder
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
data_df = combine_csv(path)
gives me the combined dataframe. But I want to combine it based on "SA1_code" if the user chooses to go to SA1 folder or "SA2_MAIN_DIGIT_CODE" if they choose to combine CSV files from SA2 folder, and so on and so forth.
How do I do this?
You don't have four separate index columns, you just have one that changes depending on user input. Therefore, the solution to your problem is relatively simple. First, modify your combine_csv method:
def combine_csv(path, index):
"""
Concatenates the csv files and create the huge dataframe combing the information in all the csv files in a given folder
Parameters
----------
Path (string); Location of the folder and files therein
Returns
----------
A dataframe of all the concatenated csv files
"""
# setting the path for joining multiple files
files = os.path.join(path+"/*.csv")
# list of merged files returned
files = glob.glob(files)
# joining files with concat and read_csv
list_csv = []
for filename in files:
list_df = pd.read_csv(filename, index_col = index)
list_csv.append(list_df)
df = pd.concat(list_csv, axis=1, ignore_index=False)
return df
All I did was inject a value, index, that will be used for the indeX_col argument to read_csv.
Next, we need to determine the value for index based on the value for level, as input by the user. According to your question, it seems that there should be a one-to-one relationship between these values. So, we can use a dictionary for this:
LevelIndexMapping = {
"SA1": "SA1_code",
"SA2": "SA2_MAIN_DIGIT_CODE",
"SA3": "SA3_MULTI",
"SA4": "SA4_REGIONS"
}
my_string = "{0}/folder1/{1}/{1} something/{2}/AUST"
analytics_path = "C:/Users/...../SharePoint - Documents/folder2"
year = "2016" # User should define this
level = "SA3" # User should define this
path = my_string.format(analytics_path, year, level)
combine_csv(path, LevelIndexMapping[level])
Here, I created a dictionary that maps your level variable to its associated index column value, and then accesses that mapping when calling combine_csv.
I would need to create a dataframe for each of the following datasets (csv files stored a folder):
0 text1.csv
1 text2.csv
2 text3.csv
3 text4.csv
4 text5.csv
The above list is created using os.chdir and lists all the csv files included in a folder in the following path:
os.chdir("path")
To create the dataframe (to be used later on) for each of the datasets above, I am doing as follow:
texts=[]
for item in glob.glob("*.csv"):
texts.append(item)
for (x,z) in enumerate(texts):
print(x,z)
df = pd.read_csv(datasets[int(x)])
df.index.name = datasets[int(x)]
However, it does not create any dataframe. I think the problem is in df, as I am not distinguishing it for each dataset (but I am only trying to read each dataset using pd.read.csv(datasets[int(x)])).
Could you please tell me how to create a dataframe per each of the datasets (for example df1 related to text1, df2 related to text2, and so on)?
Thank you for your help.
I'd use a function and return a list of the dataframes
Simple, one-liner function:
import glob
import pandas as pd
def get_all_csv(path, sep=','):
# read all the csv files in a directory to a list of dataframes
return [pd.read_csv(csv_file, sep=sep)
for csv_file in glob.glob(path + "*.csv")]
# get all the csv in the current directory
dfs = get_all_csv('./', sep=';')
print(dfs)
Is a list of dataframes what you are looking for?
import pandas as pd
import glob
results=[]
paths = glob.glob("*.csv"):
for path in paths:
df = pd.read_csv(path)
results.append(df)
I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")