How to write the filename and rowcount in a csv in python? - python

I've been trying to make a CSV from a big list of another CSVs and here's the deal: I want to get the names of these CSV files and put them in the CSV that I want to create, plus, I also need the row count from the CSV files that I'm getting the names of, here's what I've tried so far:
def getRegisters(file):
results = pd.read_csv(file, header = None, error_bad_lines= False, sep = '\t', low_memory = False)
print(len(results))
return len(results)
path = "C:/Users/gdldieca/Desktop/TESTSFORPANW/New folder"
dirs = os.listdir(path)
with open("C:/Users/gdldieca/Desktop/TESTSFORPANW/New folder/FilesNames.csv", 'w', newline='') as f:
writer = csv.writer(f, delimiter = '\t')
writer.writerow(("File", "Rows"))
for names in dirs:
sfile = getRegisters("C:/Users/gdldieca/Desktop/TESTSFORPANW/New folder/" + str(names))
writer.writerow((names, sfile))
However I can't seem to get the files row count even tho Pandas actually returns it. I'm getting this error:
_csv.Error: iterable expected, not int
The final result would be something like this written into the CSV
File1 90
File2 10

If you are using pandas , I think you can use also for make a csv file with all values that you need..Here an alternative
import os
import pandas as pd
directory='D:\\MY\\PATH\\ALLCSVFILE\\'
#create a list for add all
rows_list = []
for filename in os.listdir(directory):
if filename.endswith(".csv"):
file=os.path.join(directory, filename)
df=pd.read_csv(file)
#Count rows
rowcount=len(df.index)
new_row = {'namefile':filename, 'count':rowcount}
rows_list.append(new_row)
#pass list to dataframe
df1 = pd.DataFrame(rows_list)
print(df1)
df1.to_csv('test.csv', sep=',')
result :

Related

How to combine horizontally many CSV files using python csv or pandas module?

Hello!
I would like to combine horizontally many CSV files (the total number will oscillate around 120-150) into one CSV file by adding one column from each file (in this case column called “grid”). All those files have the same columns and number of the rows (they are constructed the same) and are stored in the same catalogue. I’ve tried with CSV module and pandas. I don't want to define all 120 files. I need a script to do it automatically. I’m stuck and I have no ideas...
Some input CSV files (data) and CSV file (merged) which I would like to get:
https://www.dropbox.com/transfer/AAAAAHClI5b6TPzcmW2dmuUBaX9zoSKYD1ZrFV87cFQIn3PARD9oiXQ
That's how my code looks like when I use the CSV module:
import os
import glob
import csv
os.chdir('\csv_files_direction')
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('\merged_csv_file_direction')
with open(out_merg,'wt') as out:
writer = csv.writer(out)
for file in files:
with open(file) as csvfile:
data = csv.reader(csvfile, delimiter=';')
result = []
for row in data:
a = row[3] #column which I need
result.append(a)
Using this code I receive values only from the last CSV. The rest is missing. As a result I would like to have one precise column from each CSV file from the catalogue.
And Pandas:
import os
import glob
import pandas as pd
import csv
os.chdir('\csv_files_direction')
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('\merged_csv_file_direction')
in_names = [pd.read_csv(f, delimiter=';', usecols = ['grid']) for f in files]
Using pandas I receive data from all CSV's as the list which can be navigated using e.g in_names[1].
I confess that this is my first try with pandas and I don't have ideas what should be my next step.
I will really appreciate any help!
Thanks in advance,
Mateusz
For the part of CSV i think you need another list define OUTSIDE the loop.
Something like
import os
import sys
dirname = os.path.dirname(os.path.realpath('__file__'))
import glob
import csv
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('merged_csv_file_direction')
result= []
with open(out_merg,'wt') as out:
writer = csv.writer(out)
for file in files:
with open(file) as csvfile:
data = csv.reader(csvfile, delimiter=';')
col = []
for row in data:
a = row[3] #column which I need
col.append(a)
result.append((col))
NOTE: I have also changed the way to go into the folder. Now you can run the file direcly in the folder that contains the 2 folders (one for take the data and the other to save the data)
Regarding the part of pandas
you can create a loop again. This time you need to CONCAT the dataframes that you have created using in_names = [pd.read_csv(f, delimiter=';', usecols = ['grid']) for f in files]
I think you can use
import os
import glob
import pandas as pd
import csv
os.chdir('\csv_files_direction')
extension = 'csv'
files = [i for i in glob.glob('*.{}'.format(extension))]
out_merg = ('\merged_csv_file_direction')
in_names = [pd.read_csv(f, delimiter=';', usecols = ['grid']) for f in files]
result = pd.concat(in_names)
Tell me if it works

How to concatenate multiple csv files into one based on column names without having to type every column header in code

I am relatively new to python (about a weeks experience) and I can't seem to find the answer to my problem.
I am trying to merge hundreds of csv files based in my folder Data into a single csv file based on column name.
The solutions I have found require me to type out either each file name or column headers which would take days.
I used this code to create one csv file but the column names move around and therefore the data is not in the same columns over the whole DataFrame:
import pandas as pd
import glob
import os
def concatenate(indir=r"C:\\Users\ge\Documents\d\de",
outfile = r"C:\Users\ge\Documents\d"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList = []
for filename in fileList:
print(filename)
df = pd.read_csv(filename, header = None)
dfList.append(df)
concatDf = pd.concat(dfList, axis = 0)
concatDf.to_csv(outfile, index= None)
Is there quick fire method to do this as I have less than a week to run statistics on the dataset.
Any help would be appreciated.
Here is one, memory efficient, way to do that.
from pathlib import Path
import csv
indir = Path(r'C:\\Users\gerardchurch\Documents\Data\dev_en')
outfile = Path(r"C:\\Users\gerardchurch\Documents\Data\output.csv")
def find_header_from_all_files(indir):
columns = set()
print("Looking for column names in", indir)
for f in indir.glob('*.csv'):
with f.open() as sample_csv:
sample_reader = csv.DictReader(sample_csv)
try:
first_row = next(sample_reader)
except StopIteration:
print("File {} doesn't contain any data. Double check this".format(f))
continue
else:
columns.update(first_row.keys())
return columns
columns = find_header_from_all_files(indir)
print("The columns are:", sorted(columns))
with outfile.open('w') as outf:
wr = csv.DictWriter(outf, fieldnames=list(columns))
wr.writeheader()
for inpath in indir.glob('*.csv'):
print("Parsing", inpath)
with inpath.open() as infile:
reader = csv.DictReader(infile)
wr.writerows(reader)
print("Done, find the output at", outfile)
This should handle case, when one of the input csvs doesn't contain all columns
I am not sure if I understand your problem correctly, but this is one of the ways that you can merge your files without giving any column names:
import pandas as pd
import glob
import os
def concatenate(indir):
os.chdir(indir)
fileList=glob.glob("*.csv")
output_file = pd.concat([pd.read_csv(filename) for filename in fileList])
output_file.to_csv("_output.csv", index=False)
concatenate(indir= r"C:\\Users\gerardchurch\Documents\Data\dev_en")

Concatenate csv files in python by ascending order of filenames

I need to concatenate csv files with same column headers in python. The csv files with the following filenames should concatenate in order as shown below(ascending order of filename):
AB201602.csv
AB201603.csv
AB201604.csv
AB201605.csv
AB201606.csv
AB201607.csv
AB201608.csv
AB201610.csv
AB201612.csv
I would like to keep the column headers only from first file. Any idea?
I tried to use the below code and it combined the csv file by random filenames and truncated half of the column header name. thanks
csvfiles = glob.glob('/home/c/*.csv')
wf = csv.writer(open('/home/c/output.csv','wb'),delimiter = ',')
for files in csvfiles:
rd = csv.reader(open(files,'r'),delimiter = ',')
rd.next()
for row in rd:
print(row)
wf.writerow(row)
Using #Gokul comment and pandas.
import pandas as pd
import glob
csvfiles = sorted(glob.glob('/home/c/*.csv'))
df = pd.DataFrame()
for files in csvfiles:
df = df.append(pd.read_csv(files))
df.to_csv('newfile.csv')

Breaking up large file, but adding header to each subsequent file

I'm using the following code to break up a large CSV file and I want the original CSV header to be written to each smaller CSV file. The problem I am having, though, is that the current code seems to skip a line of data for each smaller file. So in the example below Line 51 wouldn't be written to the smaller file (code modified from http://code.activestate.com/recipes/578045-split-up-text-file-by-line-count/). It seems to skip that line or perhaps it's being overwritten by the header:
import os
filepath = 'test.csv'
lines_per_file=50
lpf = lines_per_file
path, filename = os.path.split(filepath)
with open(filepath, 'r') as r:
name, ext = os.path.splitext(filename)
try:
w = open(os.path.join(path, '{}_{}{}'.format(name, 0, ext)), 'w')
header = r.readline()
for i, line in enumerate(r):
if not i % lpf:
#possible enhancement: don't check modulo lpf on each pass
#keep a counter variable, and reset on each checkpoint lpf.
w.close()
filename = os.path.join(path,
'{}_{}{}'.format(name, i, ext))
w = open(filename, 'w')
w.write(header)
w.write(line)
finally:
w.close()
Consider using pandas to split the large csv file:
Lets create a csv file having 500 rows and four columns using pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(500,4), columns=['a','b','c','d'])
df.to_csv('large_data.csv', index=False)
Lets split the large_data.csv in to multiple csv files of each having 50 rows:
import pandas as pd
df = pd.read_csv('large_data.csv', chunksize=50)
i = 1
for chunk in df:
chunk.to_csv('split_data_'+str(i)+'.csv', index=False)
i = i+1
This would have produced the following resultant files:

Combinging Multiple Json Objects as one DataFrame in Python Pandas

I'm not sure what I'm missing here but I have 2 zip files that contain json files and I'm just trying to combine the data I extract from the files and combine as one dataframe but my loop keeps giving me separate records. Here is what I have prior to constructing DF. I tried pd.concat but I think my issue is more to do with the way I'm reading the files in the first place.
data = []
for FileZips in glob.glob('*.zip'):
with zipfile.ZipFile(FileZips, 'r') as myzip:
for logfile in myzip.namelist():
with myzip.open(logfile) as f:
contents = f.readlines()[-2]
jfile = json.loads(contents)
print len(jfile)
returns:
40935
40935
You can use read_json (assuming it's valid json).
I would also break this up into more functions for readability:
def zip_to_df(zip_file):
with zipfile.ZipFile(zip_file, 'r') as myzip:
return pd.concat((log_as_df(loglife, myzip)
for logfile in myzip.namelist()),
ignore_index=True)
def log_as_df(logfile, myzip):
with myzip.open(logfile, 'r') as f:
contents = f.readlines()[-2]
return pd.read_json(contents)
df = pd.concat(map(zip_to_df, glob.glob('*.zip')), ignore_index=True)
Note: This does more concats, but I think it's worth it for readability, you could do just one concat...
I was able to get what I need with a small adjustment to my indent!!
dfs = []
for FileZips in glob.glob('*.zip'):
with zipfile.ZipFile(FileZips, 'r') as myzip:
for logfile in myzip.namelist():
with myzip.open(logfile, 'r') as f:
contents = f.readlines()[-2]
jfile = json.loads(contents)
dfs.append(pd.DataFrame(jfile))
df = pd.concat(dfs, ignore_index=True)
print len(df)

Categories

Resources