I'm very new to scripting and as a result am not sure how best to merge a series of files. I'm attempting to create a Quality Control script that makes sure a nightly load was properly uploaded to the DB (we've noticed that if there's a lag for some reason, the sync will exclude any donations that came in during said lag).
I have a directory of daily synced files labeled as such:
20161031_donations.txt
20161030_donations.txt
20161029_donations.txt
20161028_donations.txt
etc etc
Every file has the same header.
I'd like to merge the last 7 days of files into one file with just 1 header row. I'm mostly struggling with understanding how to wildcard a date range. I've only ever done:
for i in a.txt b.txt c.txt d.txt
do this
done
which is fine for a static merge but not dynamic to integrate into a proper QC script.
I have a unix background but would like to do this in python. I'm new to python so please be explanatory in any suggestions.
Expanding on Alex Hall's answer, you can grab the header from one file and skip it for the remaining files to do the merge
from glob import glob
from shutil import copyfileobj
files = sorted(glob('*_donations.txt'))[-7:]
# if you want most recent file first do
# files.reverse()
with open("merged_file.txt", "w") as outfp:
for i, filename in enumerate(files):
with open(filename) as infile:
if i:
next(infile) # discard header
copyfileobj(infile, outfile) # write remaining
The advantage of your date format (assuming it has zero padding, e.g. 20160203 for 3rd Feb) is that it can be sorted alphabetically! So you can just do this:
from glob import glob
for path in sorted(glob('*_donations.txt'))[-7:]:
with open(path) as f:
# get the content for merging
This will get the 7 most recent files, starting with the oldest. This is why ISO 8601 is the best date format.
Related
I'm working with JSON filetypes and I've created some code that will open a single file and add it to a pandas dataframe, performing some procedures on the data within, snipper of this code as follows;
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
The code then goes on to extract parts of the JSON data into dataframes, before merging and printing to CSV.
Where I want to develop the code, is to have it iterate through a folder first, find filenames that match my list of filenames that I want to work on and then perform the functions on those filenames. For example, I have a folder with 1000 docs, I will only need to perform the function on a sample of these.
I've created a list in CSV of the account codes that I want to work on, I've then imported the csv details and created a list of account codes as follows:
csv_file=open(r'C:\filepath','r')
cikas=[]
cikbs=[]
csv_file.readline()
for a,b,c in csv.reader(csv_file, delimiter=','):
cikas.append(a)
cikbs.append(b)
midstring=[s for s in cikbs]
print(midstring)
My account names are then stored in midstring, for example ['12345', '2468', '56789']. This means I can control which account codes are worked on by amending my CSV file in future. These names will vary at different stages hence I don't want to absolutely define them at this stage.
What I would like the code to do, is check the working directory, see if there is a file that matches for example C:\Users*12345.json. If there is, perform the pandas procedures upon it, then move to the next file. Is this possible? I've tried a number of tutorials involving glob, iglob, fnmatch etc but struggling to come up with a workable solution.
you can list all the files with .json extension in the current directory first.
import os, json
import pandas as pd
path_to_json = 'currentdir/'
json_files = [json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
print(json_files)
Now iterate over the list of json_files and perform a check
# example list json_files= ['12345.json','2468.json','56789.json']
# midstring = ['12345', '2468, '56789']
for file in json_files:
if file.split('.')[0] in midstring:
df = pd.DataFrame.from_dict(json_file)
# perform pandas functions
else:
continue
I am looking to pull in a csv file that is downloaded to my downloads folder into a pandas dataframe. Each time it is downloaded it adds a number to the end of the string, as the filename is already in the folder. For example, 'transactions (44).csv' is in the folder, the next time this file is downloaded it is named 'transactions (45).csv'.
I've looked into the glob library or using the os library to open the most recent file in my downloads folder. I was unable to produce a solution. I'm thinking I need some way to connected to the downloads path, find all csv file types, those with the string 'transactions' in it, and grab the one with the max number in the full filename string.
list(csv.reader(open(path + '/transactions (45).csv'))
I'm hoping for something like this path + '/%transactions%' + 'max()' + '.csv' I know the final answer will be completely different, but I hope this makes sense.
Assuming format "transactions (number).csv", try below:
import os
import numpy as np
files=os.listdir('Downloads/')
tranfiles=[f for f in files if 'transactions' in f]
Now, your target file is as below:
target_file=tranfiles[np.argmax([int(t.split('(')[1].split(')')[0]) for t in tranfiles])]
Read that desired file as below:
df=pd.read_csv('Downloads/'+target_file)
One option is to use regular expressions to extract the numerically largest file ID and then construct a new file name:
import re
import glob
last_id = max(int(re.findall(r" \(([0-9]+)\).csv", x)[0]) \
for x in glob.glob("transactions*.csv"))
name = f'transactions ({last_id}).csv'
Alternatively, find the newest file directly by its modification time
Note that you should not use a CSV reader to read CSV files in Pandas. Use pd.read_csv() instead.
I have downloaded about 100 csv files from the web using python. Each file is is for a month in a year, so effectively I am downloading time series data.
Now what I want is to put all of these csv files into one csv file in time order, i'm not sure how to do this one after eachother?
Also I should note that excluding the first time, I want to remove the headers every time I put a new csv file in.
This will make sense why when you see my data:
Appreciate any help, thanks
Sort your CSV files by time (presumably this can be done with an alphanumeric sort of the filenames) and then just concatenate all of them together. This is probably easier to do in bash than in python but here's a python solution (untested):
from glob import glob
# Fetch a sorted list of all .csv files
files = sorted(glob('*.csv'))
# Open output file for writing
with open('cat.csv', 'w') as fi_out:
# iterate over all csv files
for i, fname_in in enumerate(files):
# open each csv file
with open(fname_in, 'r') as fi_in:
# iterate through all files in the csv file
for i_line, line in enumerate(fi_in):
# Write all lines of the first file (i == 0)
# For all other files write all lines except the first one (i_line > 0)
if i_line > 0 or i == 0:
fi_out.write(line)
I want to open multiple csv files in python, collate them and have python create a new file with the data from the multiple files reorganised...
Is there a way for me to read all the files from a single directory on my desktop and read them in python like this?
Thanks a lot
If you a have a directory containing your csv files, and they all have the extension .csv, then you could use, for example, glob and pandas to read them all in and concatenate them into one csv file. For example, say you have a directory, like this:
csvfiles/one.csv
csvfiles/two.csv
where one.csv contains:
name,age
Keith,23
Jane,25
and two.csv contains:
name,age
Kylie,35
Jake,42
Then you could do the following in Python (you will need to install pandas with, e.g., pip install pandas):
import glob
import os
import pandas as pd
# the path to your csv file directory
mycsvdir = 'csvdir'
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csvfiles:
df = pd.read_csv(csvfile)
dataframes.append(df)
# concatenate them all together
result = pd.concat(dataframes, ignore_index=True)
# print out to a new csv file
result.to_csv('all.csv')
Note that the output csv file will have an additional column at the front containing the index of the row. To avoid this you could instead use:
result.to_csv('all.csv', index=False)
You can see the documentation for the to_csv() method here.
Hope that helps.
Here is a very simple way to do what you want to do.
import pandas as pd
import glob, os
os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("1*")):
namedf = pd.read_csv(file, skiprows=0, usecols=[1,2,3])
results = results.append(namedf)
results.to_csv('C:\\your_path\\combinedfile.csv')
Notice this part: glob("1*")
This will look only for files that start with '1' in the name (1, 10, 100, etc). If you want everything, change it to this: glob("*")
Sometimes it's necessary to merge all CSV files into a single CSV file, and sometimes you just want to merge some files that match a certain naming convention. It's nice to have this feature!
I know that the post is a little bit old, but using Glob can be quite expensive in terms of memory if you are trying to read large csv files, because you will store all that data into a list in then you'll still have to have enough memory to concatenate the dataframes inside that list into a dataframe with all the data. Sometimes this is not possible.
dir = 'directory path'
df= pd.DataFrame()
for i in range(0,24):
csvfile = pd.read_csv(dir+'/file name{i}.csv'.format(i), encoding = 'utf8')
df = df.append(csvfile)
del csvfile
So, in case your csv files have the same name and have some kind of number or string that differentiates them, you could just do a for loop trough the files and delete them after they are stored in a dataframe variable using pd.append! In this case all my csv files have the same name except they are numbered in a range that goes from 0 to 23.
I have a lot of .txt files, which together form a dataframe that is too much to get loaded into a variable (ergo there's not enough memory to load all the files into a pandas dataframe). Can I somehow get some descriptive statistics by just reading the files but not loading them into a dataframe/variable? How? Thank you!
In order to get information, you can select the files with glob, open them as text files.
Assuming this is a CSV file with column titles on the first line, you can retrieves the keys by splitting the first line.
Based on How to get line count cheaply in Python?, count the remaining lines.
import glob
filenames = glob.glob('*.txt')
for filename in filenames:
with open(filename) as f:
keys = f.readline().rstrip().split(',')
for i, l in enumerate(f):
pass
print("File:", filename, " keys:", keys," len:",i+1)