Optimizing a .txt extraction function - python

I am trying to optimize a function to import .txt files for data analysis. I have a somewhat large corpus, but given that the function only reads the documents and creates the Data Frame with each paragraph as an element I think it is taking WAY too long, around 30min for 17 documents with about 1000 pages each.
Any suggestions on how to make this faster? I only have to load it once, but it's annoying to loose half an hour to load the data.
def read_docs_paragraph(textfolder):
"""
This function reads all the files in a folder and returns a dataframe with the content of the files chunked by paragraphs
(if .txt file is organized by paragraphs) and the name of the file.
Parameters
----------
textfolder : str
The path of the folder where the files are located.
Returns
-------
df : DataFrame
A dataframe with the content of the files and the name of the file.
"""
df = pd.DataFrame()
df['Corpus'] = ''
df['Estado'] = ''
#Iterate
for filename in os.listdir(textfolder):
#Opens the file you specified
if filename.endswith('.txt'):
with open(filename, 'r', encoding='utf8') as f:
for line in f.readlines():
df_length = len(df)
df.loc[df_length] = line
df['Estado'].loc[df_length] = filename
return df

Here's a suggestion with Path and fileinput, both from the standard library:
from pathlib import Path
import fileinput
def read_docs_paragraph(textfolder):
with fileinput.input(Path(textfolder).glob("*.txt")) as files:
return pd.DataFrame(
([line, files.filename()] for line in files),
columns = ["Corpus", "Estado"]
)
I've timed it a bit and it seems like 700 times faster (that might depend on the files, machine, etc., though).
As pointed out by #FranciscoMelloCastro: If you have to be explicit about the encoding you could use openhook=fileinput.hook_encoded("utf-8"), or starting with Python 3.10 encoding="utf-8".

Related

Python script to convert multiple txt files into single one

I'm quite new to python and encountered a problem: I want to write a script that is capable of starting in a base directory with several folders, which have all the same structure in the subdirectories and are numbered with a control variable (scan00, scan01, ...)
I read out the names of the folders in the directory and store them in a variable called foldernames.
Then, the script should go in each of these folders in a subdirectory where multiple txt files are stored. I store them in the variable called "myFiles"
These txt files consits of 3 columns with float values which are separated with tabs and each of the txt files has 3371 rows (they are all the same in terms of rows and columns).
Now my issue: I want the script to copy only the third column of all txt files and put it into a new txt or csv file. The only exception is the first txt file, there it is important that all three columns are copied to the new file.
In the other files, every third column of the txt files should be copied in an adjacent column in the new txt/csv file.
So I would like to end up with x columns in the in the generated txt/csv file, where x is the number of the original txt files. If possible, I would like to write the corresponding file names in the first line of the new txt/csv file (here defined as column_names).
At the end, each folder should contain a txt/csv file, which contains all single (297) txt files.
import os
import glob
foldernames1 = []
for foldernames in os.listdir("W:/certaindirectory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certaindirectory/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[1:len(myFiles)]
files = [open(f) for f in glob.glob('*.txt')]
fout = open ("ResultsCombined.txt", 'w')
for row in range(1, 3371): #len(files)):
for f in files:
fout.write(f.readline().strip().split('\t')[2])
fout.write('\t')
fout.write('\t')
fout.close()
As an alternative I also tried to fix it via a csv file, but I wasn't able to fix my problem:
import os
import glob
import csv
foldernames1 = []
for foldernames in os.listdir("W:/certain directory/"):
if foldernames.startswith("scan"):
# print(foldernames)
foldernames1.append(foldernames)
for i in range(1, len(foldernames1)):
workingpath = "W:/certain direcotry/"+foldernames1[i]+"/.../"
os.chdir(workingpath)
myFiles = glob.glob('*.txt')
column_names = ['X','Y']+myFiles[0:len(myFiles)]
# print(column_names)
with open(""+foldernames1[i]+".csv", 'w', newline='') as target:
writer = csv.DictWriter(target, fieldnames=column_names)
writer.writeheader() # if you want a header
for path in glob.glob('*.txt'):
with open(path, newline='') as source:
reader = csv.DictReader(source, delimiter='\t', fieldnames=column_names)
writer.writerows(reader)
Can anyone help me? Both codes do not deliver what I want. They are reading out something, but not the values I am interesed in. I have the feeling my code has also some issues with float numbers?
Many thanks and best regards,
quester
pathlib and pandas should make the solution here relatively simple even without knowing the specific file names:
import pandas as pd
from pathlib import Path
p = Path("W:/certain directory/")
# recursively search for .txt files inside all sub directories
txt_files = [txt_file for txt_file in p.rglob("*.txt")] # p.iterdir() --> glob("*.txt") for none recursive iteration
df = pd.DataFrame()
for path in txt_files:
# use tab separator, read only 3rd column, name the column, read as floats
current = pd.read_csv(path,
sep="\t",
usecols=[2],
names=[path.name],
dtype="float64")
# add header=0 to pd.read_csv if there's a header row in the .txt files
pd.concat([df, current], axis=1)
df.to_csv("W:/certain directory/floats_third_column.csv", index=False)
Hope this helps!

How can i speed up importing many csv's while filtering and adding filenames?

I have some Python (3.8) code that does the following:
Walks directory and subdirectories of a given path
Finds all .csv files
Finds all .csv files with 'Pct' in filename
Joins path and file
Reads CSV
Adds filename to df
Concatonates all dfs together
The code below works, but takes a long time (15mins) to ingest all the CSV's - there are 52,000 files. This might not in fact be a long time, but I want to reduce this as much as possible.
My current working code is below:
start_dirctory='/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/' # change this
df_result= None
#loop_number = 0
for path, dirs, files in os.walk(start_dirctory):
for file in sorted(fnmatch.filter(files, '*.csv')): # find .csv files
# print(file)
if 'Pct' in file: # filter if contains 'Pct'
# print('Pct = ', file)
full_name=os.path.join(path, file) # make full file path
df_tmp= pd.read_csv(full_name, header=None) # read file to df_tmp
df_tmp['file']=os.path.basename(file) # df.file = file name
if df_result is None:
df_result= df_tmp
else:
df_result= pd.concat([df_result, df_tmp], axis='index', ignore_index=True)
#print(full_name, 'imported')
#loop_number = loop_number + 1
#print('Loop number =', loop_number)
Inspired by this post (glob to find files recursively) and this post (how to speed up importing csvs), I have tried to reduce the time that it takes to ingest all the data, but can't figure out a way to integrate a filer for only filenames that contain 'Pct' and then to add the filename to the df. This might not be possible with the code from these examples.
What I have tried below (incomplete):
%%time
import glob
import pandas as pd
df = pd.concat(
[pd.read_csv(f, header=None)
for f in glob.glob('/home/ubuntu/Desktop/noise_paper/part_2/Noise/Data/**/*.csv', recursive=True)
],
axis='index', ignore_index=True
)
Question
Is there any way that I can reduce the time to read and ingest the CSV's in my code above?
Thanks!
Check out the following solution, this assumes the open-file system limit is high enough, because this will stream every file one by one, but it has to open each of them to read headers. In cases where files have different columns, you will get the superset of them in the resulting file:
from convtools import conversion as c
from convtools.contrib.tables import Table
files = sorted(
os.path.join(path, file)
for path, dirs, files in os.walk(start_dirctory)
for file in files
if "Pct" in file and file.endswith(".csv")
)
table = None
for file in files:
table_ = Table.from_csv(file, header=True) # assuming there's header
if table is None:
table = table_
else:
table.chain(table_)
# this will be an iterable of dicts, so consume with pandas or whatever
table.into_iter_rows(dict) # or list, or tuple
# or just write the new file like:
# >>> table.into_csv("concatenated.csv")
# HOWEVER: into_* can only be used once, because Table
# cannot assume the incoming data stream can be read twice
If you are sure that all the files have same columns (one file is being opened at a time):
edited to add file column
def concat_files(files):
for file in files:
yield from Table.from_csv(file, header=True).update(
file=file
).into_iter_rows(dict)
# this will be an iterable of dicts, so consume with pandas or whatever
concat_files(files)
P.S. of course you can replace Table.from_csv with a standard/other reader, but this one adapts to the file, so it is generally faster on large files.

How to open multiple csv files from a folder in python?

I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.

how to read multiple csv files in a directory through python csv() function?

In one of my directory, I have multiple CSV files. I wanted to read the content of all the CSV file through a python code and print the data but till now I am not able to do so.
All the CSV files have the same number of columns and the same column names as well.
I know a way to list all the CSV files in the directory and iterate over them through "os" module and "for" loop.
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
Now use the "csv" module to read the files name
reader = csv.reader(files)
till here I expect the output to be the names of the CSV files. which happens to be sorted. for example, names are 1.csv, 2.csv so on. But the output is as below
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
if I add next() function after the csv.reader(), I get below output
['1']
['2']
['3']
['4']
['5']
['6']
This happens to be the initials of my CSV files name. Which is partially correct but not fully.
Apart from this once I have the files iterated, how to see the contents of the CSV files on the screen? Today I have 6 files. Later on, I could have 100 files. So, it's not possible to use the file handling method in my scenario.
Any suggestions?
The easiest way I found during developing my project is by using dataframe, read_csv, and glob.
import glob
import os
import pandas as pd
folder_name = 'train_dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
Here, all the csv files are loaded into 1 big dataframe.
I would recommend reading your CSVs using the pandas library.
Check this answer here: Import multiple csv files into pandas and concatenate into one DataFrame
Although you asked for python in general, pandas does a great job at data I/O and would help you here in my opinion.
till here I expect the output to be the names of the CSV files
This is the problem. csv.reader objects do not represent filenames. They represent lazy objects which may be iterated to yield rows from a CSV file. Or, if you wish to print the entire CSV file, you can call list on the csv.reader object:
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
reader = csv.reader(files)
print(list(reader))
if I add next() function after the csv.reader(), I get below output
Yes, this is what you should expect. Calling next on an iterator will give you the next value which comes out of that iterator. This would be the first line of each file. For example:
from io import StringIO
import csv
some_file = StringIO("""1
2
3""")
with some_file as fin:
reader = csv.reader(fin)
print(next(reader))
['1']
which happens to be sorted. for example, names are 1.csv, 2.csv so on.
This is either a coincidence or a correlation between the filename and the contents of the respective file. Calling next(reader) will not output part of a filename.
Apart from this once I have the files iterated, how to see the
contents of the csv files on the screen?
Use the print command, as in the examples above.
Today I have 6 files. Later on, I could have 100 files. So, it's not
possible to use the file handling method in my scenario.
This is not true. You can define a function to print all or part or your csv file. Then call that function in a for loop with filename as an input.
If you want to import your files as separate dataframes, you can try this:
import pandas as pd
import os
filenames = os.listdir("../data/") # lists all csv files in your directory
def extract_name_files(text): # removes .csv from the name of each file
name_file = text.strip('.csv').lower()
return name_file
names_of_files = list(map(extract_name_files,filenames)) # creates a list that will be used to name your dataframes
for i in range(0,len(names_of_files)): # saves each csv in a dataframe structure
exec(names_of_files[i] + " = pd.read_csv('../data/'+filenames[i])")
You can read and store several dataframes into separate variables using two lines of code.
import pandas as pd
datasets_list = ['users', 'calls', 'messages', 'internet', 'plans']
users, calls, messages, internet, plans = [(pd.read_csv(f'datasets/{dataset_name}.csv')) for dataset_name in datasets_list]

Reading huge number of json files in Python?

This is not about reading large JSON files, instead it's about reading large number of JSON files in the most efficient way.
Question
I am working with last.fm dataset from the Million song dataset.
The data is available as a set of JSON-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.
Currently I'm reading them into pandas in the following way after going through a few options as this is the fastest as shown here:
import os
import pandas as pd
try:
import ujson as json
except ImportError:
try:
import simplejson as json
except ImportError:
import json
# Path to the dataset
path = "../lastfm_train/"
# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')]
data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
The current method reads the subset (1% of full dataset in less than a second). However, reading the full train set is too slow and takes forever (I have waited for couple of hours as well) to read and has become a bottleneck for further tasks such as shown in question here.
I'm also using ujson for speed purposes in parsing json files which can be seen evidently from this question here
UPDATE 1
Using generator comprehension instead of list comprehension.
data_list=(json.load(open(file)) for file in all_files)
If you need to read and write the dataset multiple times, you could try converting .json files into a faster format. For example in pandas 0.20+ you could try using the .feather format.
I would build an iterator on files and just yield the two columns you want.
Then you can instantiate a DataFrame with that iterator.
import os
import json
import pandas as pd
# Path to the dataset
path = "../lastfm_train/"
def data_iterator(path):
for root, dirs, files in os.walk(path):
for f in files:
if f.endswith('.json'):
fp = os.path.join(root,f)
with open(fp) as o:
data = json.load(o)
yield {"similars" : data["similars"], "track_id": data["track_id"]}
df = pd.DataFrame(data_iterator(path))
df.set_index('track_id', inplace=True)
This way you only go over your files list once and you won't duplicate the data before and after passing it to DataFrame

Categories

Resources