I have a lot of .txt files, which together form a dataframe that is too much to get loaded into a variable (ergo there's not enough memory to load all the files into a pandas dataframe). Can I somehow get some descriptive statistics by just reading the files but not loading them into a dataframe/variable? How? Thank you!
In order to get information, you can select the files with glob, open them as text files.
Assuming this is a CSV file with column titles on the first line, you can retrieves the keys by splitting the first line.
Based on How to get line count cheaply in Python?, count the remaining lines.
import glob
filenames = glob.glob('*.txt')
for filename in filenames:
with open(filename) as f:
keys = f.readline().rstrip().split(',')
for i, l in enumerate(f):
pass
print("File:", filename, " keys:", keys," len:",i+1)
Related
I have downloaded about 100 csv files from the web using python. Each file is is for a month in a year, so effectively I am downloading time series data.
Now what I want is to put all of these csv files into one csv file in time order, i'm not sure how to do this one after eachother?
Also I should note that excluding the first time, I want to remove the headers every time I put a new csv file in.
This will make sense why when you see my data:
Appreciate any help, thanks
Sort your CSV files by time (presumably this can be done with an alphanumeric sort of the filenames) and then just concatenate all of them together. This is probably easier to do in bash than in python but here's a python solution (untested):
from glob import glob
# Fetch a sorted list of all .csv files
files = sorted(glob('*.csv'))
# Open output file for writing
with open('cat.csv', 'w') as fi_out:
# iterate over all csv files
for i, fname_in in enumerate(files):
# open each csv file
with open(fname_in, 'r') as fi_in:
# iterate through all files in the csv file
for i_line, line in enumerate(fi_in):
# Write all lines of the first file (i == 0)
# For all other files write all lines except the first one (i_line > 0)
if i_line > 0 or i == 0:
fi_out.write(line)
In one of my directory, I have multiple CSV files. I wanted to read the content of all the CSV file through a python code and print the data but till now I am not able to do so.
All the CSV files have the same number of columns and the same column names as well.
I know a way to list all the CSV files in the directory and iterate over them through "os" module and "for" loop.
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
Now use the "csv" module to read the files name
reader = csv.reader(files)
till here I expect the output to be the names of the CSV files. which happens to be sorted. for example, names are 1.csv, 2.csv so on. But the output is as below
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
if I add next() function after the csv.reader(), I get below output
['1']
['2']
['3']
['4']
['5']
['6']
This happens to be the initials of my CSV files name. Which is partially correct but not fully.
Apart from this once I have the files iterated, how to see the contents of the CSV files on the screen? Today I have 6 files. Later on, I could have 100 files. So, it's not possible to use the file handling method in my scenario.
Any suggestions?
The easiest way I found during developing my project is by using dataframe, read_csv, and glob.
import glob
import os
import pandas as pd
folder_name = 'train_dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
Here, all the csv files are loaded into 1 big dataframe.
I would recommend reading your CSVs using the pandas library.
Check this answer here: Import multiple csv files into pandas and concatenate into one DataFrame
Although you asked for python in general, pandas does a great job at data I/O and would help you here in my opinion.
till here I expect the output to be the names of the CSV files
This is the problem. csv.reader objects do not represent filenames. They represent lazy objects which may be iterated to yield rows from a CSV file. Or, if you wish to print the entire CSV file, you can call list on the csv.reader object:
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
reader = csv.reader(files)
print(list(reader))
if I add next() function after the csv.reader(), I get below output
Yes, this is what you should expect. Calling next on an iterator will give you the next value which comes out of that iterator. This would be the first line of each file. For example:
from io import StringIO
import csv
some_file = StringIO("""1
2
3""")
with some_file as fin:
reader = csv.reader(fin)
print(next(reader))
['1']
which happens to be sorted. for example, names are 1.csv, 2.csv so on.
This is either a coincidence or a correlation between the filename and the contents of the respective file. Calling next(reader) will not output part of a filename.
Apart from this once I have the files iterated, how to see the
contents of the csv files on the screen?
Use the print command, as in the examples above.
Today I have 6 files. Later on, I could have 100 files. So, it's not
possible to use the file handling method in my scenario.
This is not true. You can define a function to print all or part or your csv file. Then call that function in a for loop with filename as an input.
If you want to import your files as separate dataframes, you can try this:
import pandas as pd
import os
filenames = os.listdir("../data/") # lists all csv files in your directory
def extract_name_files(text): # removes .csv from the name of each file
name_file = text.strip('.csv').lower()
return name_file
names_of_files = list(map(extract_name_files,filenames)) # creates a list that will be used to name your dataframes
for i in range(0,len(names_of_files)): # saves each csv in a dataframe structure
exec(names_of_files[i] + " = pd.read_csv('../data/'+filenames[i])")
You can read and store several dataframes into separate variables using two lines of code.
import pandas as pd
datasets_list = ['users', 'calls', 'messages', 'internet', 'plans']
users, calls, messages, internet, plans = [(pd.read_csv(f'datasets/{dataset_name}.csv')) for dataset_name in datasets_list]
I would like to be able to read data from multiple files in one folder to multiple arrays and then perform analysis on these arrays such as plot graphs etc. I am currently having trouble reading the data from these files into multiple arrays.
My solution process so far is as follows;
import numpy as np
import os
#Create an empty list to read filenames to
filenames = []
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
This works so far, what I'd like to do next is to iterate over the filenames in the list using numpy.genfromtxt.
I'm trying to use os.path join to put the individual list entry at the end of the path specified in listdir earlier. This is some example code:
for i in filenames:
file_name = os.path.join('C:\\entryfromabove','i')
'data_'+[i] = np.genfromtxt('file_name',skiprows=2,delimiter=',')
This piece of code returns "Invalid syntax".
To sum up the solution process I'm trying to use so far:
1. Use os.listdir to get all the filenames in the folder I'm looking at.
2. Use os.path.join to direct np.genfromtxt to open and read data from each file to a numpy array named after that file.
I'm not experienced with python by any means - any tips or questions on what I'm trying to achieve are welcome.
For this kind of task you'd want to use a dictionary.
data = {}
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
path = os.path.join('C:\\folderwherefileslive', i)
data[file] = np.genfromtxt(path, skiprows=2, delimiter=',')
# now you could for example access
data['foo.txt']
Notice, that everything you put within single or double quotes ends up being a character string, so 'file_name' will just be some characters, whereas using file_name would use the value stored in variable by that name.
I am using
for file in fileList:
f.write(open(file).read())
I am combining files if a folder to one csv. However I dont need X amount of headers in the one file.
Is there a way to use this and have it write everything but the first row (header) coming from the files in the files?
Use python csv module
Or something like that:
for file_name in file_list:
file_obj = open(file_name)
file_obj.read()
f.write(file_obj.read())
This solution doesn't load whole file into memory, so when you use file_obj.readlines(), whole file content is load into memory
Note, that it isn't good practice to name variables with builtin names
for file in fileList:
mylines = open(file).readlines()
f.write("".join(mylines[1:]))
This should point you in the right direction. Please don't do your homework on stackoverflow.
If it's a cvs file, look into python csv lib.
I have a directory with 260+ text files containing scoring information. I want to create a summary text file of all of these files containing filename and the first two lines of each file. My idea was to create two lists separately and 'zip' them. However, I can get the list of the filenames but I can't get the first two lines of the file into an a appended list. Here is my code so far:
# creating a list of filename
for f in os.listdir("../scores"):
(pdb, extension) = os.path.splitext(f)
name.append(pdb[1:5])
# creating a list of the first two lines of each file
for f in os.listdir("../scores"):
for line in open(f):
score.append(line)
b = f.nextline()
score.append(b)
I get an error the str had no attribute nextline. Please help, thanks in advance.
The problem you're getting is a result of trying to take more than one line at a time from the scores file using a file iterator (for line in f). Here's a quick fix (one of several ways to do it, I'm sure):
# creating a list of the first two lines of each file
for f in os.listdir("../scores"):
with open(f) as fh:
score.append(fh.readline())
score.append(fh.readline())
The with statement takes care of closing the file for you after you're done, and it gives you a filehandle object (fh), which you can grab lines from manually.
File objects have a next() method not nextline().
Merging comment from David and answer from perimosocordiae:
from __future__ import with_statement
from itertools import islice
import os
NUM_LINES = 2
with open('../dir_summary.txt','w') as dir_summary:
for f in os.listdir('.'):
with open(f) as tf:
dir_summary.write('%s: %s\n' % (f, repr(', '.join(islice(tf,NUM_LINES)))))
Here is my maybe more old fashioned version with redirected printing for easier newlines.
## written for Python 2.7, summarize filename and two first lines of files of given filetype
import os
extension = '.txt' ## replace with extension of desired files
os.chdir('.') ## '../scores') ## location of files
summary = open('summary.txt','w')
# creating a list of filenames with right extension
for fn in [fn for fn in os.listdir(os.curdir) if os.path.isfile(fn) and fn.endswith(extension)]:
with open(fn) as the_file:
print >>summary, '**'+fn+'**'
print >>summary, the_file.readline(), the_file.readline(),
print >>summary, '-'*60
summary.close()
## show resulta
print(open('summary.txt').read())