I'm working with around 30 equally-looking csv files in Pandas (one-minute timeseries data, one year each, ~100 MB). Mostly, I do the same operation on each of the 30 dataframes. Is there a convenient way to apply an operation on each of the dataframes at once but keeping the files separately? Something like this?
for df in df1,df2,df3:
df=df.dropna(subset=['A','B'])
df['C'] = df['A']/df['B']
df_a = df[(df.C >= 50)]
The files can be manipulated or filtered as you need and then stored in a dictionary for review
Each new dataframe is a value in the dictionary
Use pathlib to find all the files
Treats file names and paths as objects with methods
.stem gets the file name, which can be used as part of the key for the stored dataframes
from pathlib import Path
import pandas as pd
p = Path('csv_files/Data/') # path to files
files = list(p.glob('*.csv')) # list of files
df_dict = dict()
for file in files:
df = pd.read_csv(file) # create dataframe from file
df.dropna(subset=['A', 'B'], inplace=True)
df['C'] = df['A'] / df['B']
key = f'filtered_{file.stem}'
df_dict[key] = df[(df.C >= 50)] # create a new dataframe and store it in a dictionary
# get list of keys
print(df_dict.keys())
# access a dataframe
df_dict['filtered_file_name']
# save dataframes
for k, v in df_dict.items():
v.to_csv(f'{k}.csv')
f-strings
f-Strings: A New and Improved Way to Format Strings in Python
PEP 498 - Literal String Interpolation
pathlib
pathlib, part of the standard library.
Python 3's pathlib Module: Taming the File System
Related
I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`
I can read one ann file into pandas dataframe as follows:
df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()
But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.
How can I read many ann files into one pandas dataframe?
It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.
I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.
import pandas as pd
import glob
path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")
# create empty list to hold dataframes from files found
dfs = []
# for each file in the path above ending .ann
for file in all_files:
#open the file
df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
#at this point you have a list of frames with each list item as one .ann file. Like [annFile1, annFile2, etc.] - just not those names.
#handle a list that is empty
if len(dfs) == 0:
print('No files found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())
I would need to create a dataframe for each of the following datasets (csv files stored a folder):
0 text1.csv
1 text2.csv
2 text3.csv
3 text4.csv
4 text5.csv
The above list is created using os.chdir and lists all the csv files included in a folder in the following path:
os.chdir("path")
To create the dataframe (to be used later on) for each of the datasets above, I am doing as follow:
texts=[]
for item in glob.glob("*.csv"):
texts.append(item)
for (x,z) in enumerate(texts):
print(x,z)
df = pd.read_csv(datasets[int(x)])
df.index.name = datasets[int(x)]
However, it does not create any dataframe. I think the problem is in df, as I am not distinguishing it for each dataset (but I am only trying to read each dataset using pd.read.csv(datasets[int(x)])).
Could you please tell me how to create a dataframe per each of the datasets (for example df1 related to text1, df2 related to text2, and so on)?
Thank you for your help.
I'd use a function and return a list of the dataframes
Simple, one-liner function:
import glob
import pandas as pd
def get_all_csv(path, sep=','):
# read all the csv files in a directory to a list of dataframes
return [pd.read_csv(csv_file, sep=sep)
for csv_file in glob.glob(path + "*.csv")]
# get all the csv in the current directory
dfs = get_all_csv('./', sep=';')
print(dfs)
Is a list of dataframes what you are looking for?
import pandas as pd
import glob
results=[]
paths = glob.glob("*.csv"):
for path in paths:
df = pd.read_csv(path)
results.append(df)
Hello I am trying to get a list of file names and file creation dates from a directory and insert them in a pandas data frame but I am getting a type error.
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Any help on how to do this would be create thanks
import os
import time
import pandas as pd
cur = os.getcwd()
folder = os.listdir(cur)
files = []
for f in folder:
files.append(f)
creation = []
for cd in files:
c = time.ctime(os.path.getctime(cd))
creation.append(c)
filenames = pd.DataFrame(files, columns=['Files'])
file_creation = pd.DataFrame(creation, columns=['Date Created'])
df = pd.concat(filenames, file_creation)
The message says,that the first argument of pd.concat should be the sequence of series or dataframes to be concatenated. You passed one dataframe as first Argument and the other one as the second. But the second argument is already sth else, the axis to concatenate along in this case.
So try
df = pd.concat([filenames, file_creation], 1)
However, imo it is not the shortest way to first create two dataframes in order to concatenate them afterwards. You should create a final dataframe directly from the two lists:
df = pd.DataFrame({'Files': files, 'Date Created': creation})
Looks like you're betting off building a generator of 2-tuples (filename and the timestamp converted to an actual datetime object), then build your DataFrame directly from that, eg:
import pathlib
import pandas as pd
files = (
(file.name, pd.to_datetime(file.stat().st_ctime, unit='s'))
for file in pathlib.Path.cwd().iterdir()
)
df = pd.DataFrame(files, columns=['Files', 'Creation Time'])
I would like to create a scalable code to import multiple CSV files, standardize the order of the colnumns based on the colnames and re-write CSV files.
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.glob('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(x,delimiter=";") for x in csv_files]
A=pd.DataFrame(dfs[0])
B=pd.DataFrame(dfs[1])
alpha=A.columns.values.tolist()
print([pd.DataFrame(x[alpha]) for x in dfs])
I would like to be able to split this object and write CSV for each of the file and rename them with the original names. is that easily possible with python? Thansk for your help.
If you want to reorder columns by a consistent order, assuming that all csv's have the same column names but in a different order, you can sort one of the column name lists and then order the other ones by that list. Using your example:
csv_files = glob.glob('*.csv')
sorted_columns = []
for e,x in enumerate(csv_files):
df = pd.read_csv(x,delimiter=";")
if e==0:
sorted_columns = sorted(df.columns.values.tolist())
df[sorted_columns].to_csv(x, sep=";")