I have 3000 csv files for machine learning and I need to treat each of these files separately, but the code I will apply is the same. File sizes and number of lines vary between 16 kb and 25 mb, and 60 lines and 330 thousand lines, respectively. Also, each csv file has 77 columns. I wrote only code inside the loop with the help of the previous post, but after applying the code, I cannot update within the same files. I just applied the code from previous post
and get error "No such file or directory: '101510EF'" (101510EF is my first csv file on my folder)
Loooking forward to your help. Thank you!
You don't need the line:
file_name=os.path.splitext(...)
Just this:
path = "absolute/path/to/your/folder"
os.chdir(path)
all_files = glob.glob('*.csv')
for file in all_files:
df = pd.read_csv(file)
df["new_column"] = df["seq"] + df["log_id"]
df.to_csv(file)
You need to provide absolute path WITH extension to pd.read_csv and df.to_csv methods.
e.g. c:/Users/kaanarik/Desktop/tez_deneme/ornel/101510EF.csv
Related
I am trying to optimize a function to import .txt files for data analysis. I have a somewhat large corpus, but given that the function only reads the documents and creates the Data Frame with each paragraph as an element I think it is taking WAY too long, around 30min for 17 documents with about 1000 pages each.
Any suggestions on how to make this faster? I only have to load it once, but it's annoying to loose half an hour to load the data.
def read_docs_paragraph(textfolder):
"""
This function reads all the files in a folder and returns a dataframe with the content of the files chunked by paragraphs
(if .txt file is organized by paragraphs) and the name of the file.
Parameters
----------
textfolder : str
The path of the folder where the files are located.
Returns
-------
df : DataFrame
A dataframe with the content of the files and the name of the file.
"""
df = pd.DataFrame()
df['Corpus'] = ''
df['Estado'] = ''
#Iterate
for filename in os.listdir(textfolder):
#Opens the file you specified
if filename.endswith('.txt'):
with open(filename, 'r', encoding='utf8') as f:
for line in f.readlines():
df_length = len(df)
df.loc[df_length] = line
df['Estado'].loc[df_length] = filename
return df
Here's a suggestion with Path and fileinput, both from the standard library:
from pathlib import Path
import fileinput
def read_docs_paragraph(textfolder):
with fileinput.input(Path(textfolder).glob("*.txt")) as files:
return pd.DataFrame(
([line, files.filename()] for line in files),
columns = ["Corpus", "Estado"]
)
I've timed it a bit and it seems like 700 times faster (that might depend on the files, machine, etc., though).
As pointed out by #FranciscoMelloCastro: If you have to be explicit about the encoding you could use openhook=fileinput.hook_encoded("utf-8"), or starting with Python 3.10 encoding="utf-8".
I have some Python (3.8) code that does the following:
Walks directory and subdirectories of a given path
Finds all .csv files
Finds all .csv files with 'Pct' in filename
Joins path and file
Reads CSV
Adds filename to df
Concatonates all dfs together
The code below works, but takes a long time (15mins) to ingest all the CSV's - there are 52,000 files. This might not in fact be a long time, but I want to reduce this as much as possible.
My current working code is below:
start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/' # change this
df_result= None
#loop_number = 0
for path, dirs, files in os.walk(start_dirctory):
for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
# print(file)
if 'Pct' in file: # filter if contains 'Pct'
# print('Pct = ', file)
full_name=os.path.join(path, file) # make full file path
df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
df_tmp['file']=os.path.basename(file) # df.file = file name
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
#print(full_name, 'imported')
#loop_number = loop_number + 1
#print('Loop number =', loop_number)
Inspired by this post (glob to find files recursively) and this post (how to speed up importing csvs), I have tried to reduce the time that it takes to ingest all the data, but can't figure out a way to integrate a filer for only filenames that contain 'Pct' and then to add the filename to the df. This might not be possible with the code from these examples.
What I have tried below (incomplete):
%%time
import glob
import pandas as pd
df = pd.concat(
[pd.read_csv(f, header=None)
for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
],
axis='index', ignore_index=True
)
Question
Is there any way that I can reduce the time to read and ingest the CSV's in my code above?
Thanks!
Check out the following solution, this assumes the open-file system limit is high enough, because this will stream every file one by one, but it has to open each of them to read headers. In cases where files have different columns, you will get the superset of them in the resulting file:
from convtools import conversion as c
from convtools.contrib.tables import Table
files = sorted(
os.path.join(path, file)
for path, dirs, files in os.walk(start_dirctory)
for file in files
if "Pct" in file and file.endswith(".csv")
)
table = None
for file in files:
table_ = Table.from_csv(file, header=True) # assuming there's header
if table is None:
table = table_
else:
table.chain(table_)
# this will be an iterable of dicts, so consume with pandas or whatever
table.into_iter_rows(dict) # or list, or tuple
# or just write the new file like:
# >>> table.into_csv("concatenated.csv")
# HOWEVER: into_* can only be used once, because Table
# cannot assume the incoming data stream can be read twice
If you are sure that all the files have same columns (one file is being opened at a time):
edited to add file column
def concat_files(files):
for file in files:
yield from Table.from_csv(file, header=True).update(
file=file
).into_iter_rows(dict)
# this will be an iterable of dicts, so consume with pandas or whatever
concat_files(files)
P.S. of course you can replace Table.from_csv with a standard/other reader, but this one adapts to the file, so it is generally faster on large files.
I'm trying to import multiple csv files in one folder into one data frame. This is my code. It can iterate through the files and print them successfully and it can read one file into a data frame but combining them is printing an error. I saw many questions similar but the responses are complex, I thought the 'pythonic' way is to be simple because I am new to this. Thanks in advance for any help. The error message is always: No such file or directory: 'some file name' which makes no sense because it successfully printed the file name in the print step.
import pandas as pd
# this works
df = pd.read_csv("headlines/2017-1.csv")
print(df)
path = 'C:/.../... /.../headlines/' <--- full path I shortened it here
files = os.listdir(path)
print(files) <-- prints all file names successfully
for filename in files:
print(filename) # <-- successfully prints all file names
df = pd.read_csv(filename) # < -- error here
df2.append(df) # append to data frame
It seems like your current working directory is different from your path. Please use
os.chdir(path) before attempting to read your csv.
I'm new in python ...I have tried to apply this code to merge multiple csv files but it doesn't work..basically, I have a files which contains stock prices with header: date,open,High,low,Close,Adj Close Volume... . but each csv file has a different name: Apl.csv,VIX.csv,FCHI.csv etc..
I would like to merge all these csv files in One.. but I would like to add a new columns which will disclose the name of the csv files example:
stock_id,date,open,High,low,Close,Adj Close Volume with stock_id = apl,Vix etc..
I used this code but I got stuck in line 4
here is the code:
files = os.listdir()
file_list = list()
for file in os.listdir():
if file.endswith(".csv")
df=pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True)
all_days.to_csv("all.csv")
Someone could help me to sort out this ?
In Python, the indentation level matters, and you need a colon at the end of an if statement. I can't speak to the method you're trying, but you can clean up the synax with this:
files = os.listdir()
file_list = list()
for file in os.listdir():
if file.endswith(".csv"):
df=pd.read_csv(file,sep=";")
df['filename'] = file
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True)
all_days.to_csv("all.csv")
I'm relatively new in python ..here is what I'd like to do..I got a folder with multiples csv files ( 2018.csv,2017.csv,2016.csv etc..)500 csv files to be precise.. each csv contains header "date","Code","Cur",Price etc..I'd like to concatenate all 500 csv files in one datafame...here is my code for one csv files but it's very slow , I want to do it for all 500 files and concantanate in one dataframe :
DB_2017 = pd.read_csv("C:/folder/2018.dat",sep=",", header =None).iloc[: 0,4,5,6]
DB_2017.columns =["date","Code","Cur",Price]
DB_2017['Code'] =DB_2017['Code'].map(lambdax:x.lstrip('#').rstrip('#'))
DB_2017['Cur'] =DB_2017['Cur'].map(lambdax:x.lstrip('#').rstrip('#'))
DB_2017['date'] =DB_2017['date'].apply(lambdax:pd.timestamp(str(x)[:10)
DB_2017['Price'] =pd.to_numeric(DB_2017.Price.replace(',',';')
I am currently learning how to work with Python and for the moment I am very fond of working with CSV files. I managed to learn a few things and now I want to apply what I learned to multiple files at once. But something got me confused. I have this code:
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".csv"):
paths=os.path.join(root,file)
tables=pd.read_csv(paths, header='infer', sep=',')
print(paths)
print(tables)
It prints all the CSV files found in that folder in a certain format ( a kind of table with the first row being a header and the rest following under)
The trick is that I want to be able to access these anytime (print and edit) and what I wrote there only prints them ONCE. if I write print(paths) or prints(tables) anywhere else after that it only prints the LAST CSV file and its data, even though I believe it should do the same thing.
I also tried making similar separate codes for each print (tables and paths) but it only works for the first os.walk() - I just don`t get why it only works once.
Thank you!
You will want to store the DataFrames as you load them. Right now you are just loading and discarding.
dfs = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".csv"):
paths=os.path.join(root,file)
tables=pd.read_csv(paths, header='infer', sep=',')
dfs.append(tables)
print(paths)
print(tables)
The above will give you a list of DataFrames dfs that you can then access and utilize. Like so:
print(dfs[0])
# prints the first DataFrame you read in.
for df in dfs:
print(df)
# prints each DataFrame in sequence
Once you have the data stored you can do pretty much anything.