Joining multiple CSV Files - python

I have a folder with multiple csv files. Every file consists of a date and a value column. I would like to merge all files into one where the first column consists of the value date (which is the same for each file) and the other columns are populated by the values of each single vile i.e. (date, value_file1, value_file2...)
Any suggestions on how this could be achieved via a simple python script or maybe evan through a unix command?
Thank you for your help!

I'd recommend using a tool like csvkit's csvjoin
pip install csvkit
$ csvjoin --help
usage: csvjoin [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
[-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-v] [-l]
[--zero] [-c COLUMNS] [--outer] [--left] [--right]
[FILE [FILE ...]]
Execute a SQL-like join to merge CSV files on a specified column or columns.
positional arguments:
FILE The CSV files to operate on. If only one is specified,
it will be copied to STDOUT.
optional arguments:
-h, --help show this help message and exit
-d DELIMITER, --delimiter DELIMITER
Delimiting character of the input CSV file.
-t, --tabs Specifies that the input CSV file is delimited with
tabs. Overrides "-d".
-q QUOTECHAR, --quotechar QUOTECHAR
Character used to quote strings in the input CSV file.
-u {0,1,2,3}, --quoting {0,1,2,3}
Quoting style used in the input CSV file. 0 = Quote
Minimal, 1 = Quote All, 2 = Quote Non-numeric, 3 =
Quote None.
-b, --doublequote Whether or not double quotes are doubled in the input
CSV file.
-p ESCAPECHAR, --escapechar ESCAPECHAR
Character used to escape the delimiter if --quoting 3
("Quote None") is specified and to escape the
QUOTECHAR if --doublequote is not specified.
-z MAXFIELDSIZE, --maxfieldsize MAXFIELDSIZE
Maximum length of a single field in the input CSV
file.
-e ENCODING, --encoding ENCODING
Specify the encoding the input CSV file.
-S, --skipinitialspace
Ignore whitespace immediately following the delimiter.
-v, --verbose Print detailed tracebacks when errors occur.
-l, --linenumbers Insert a column of line numbers at the front of the
output. Useful when piping to grep or as a simple
primary key.
--zero When interpreting or displaying column numbers, use
zero-based numbering instead of the default 1-based
numbering.
-c COLUMNS, --columns COLUMNS
The column name(s) on which to join. Should be either
one name (or index) or a comma-separated list with one
name (or index) for each file, in the same order that
the files were specified. May also be left
unspecified, in which case the two files will be
joined sequentially without performing any matching.
--outer Perform a full outer join, rather than the default
inner join.
--left Perform a left outer join, rather than the default
inner join. If more than two files are provided this
will be executed as a sequence of left outer joins,
starting at the left.
--right Perform a right outer join, rather than the default
inner join. If more than two files are provided this
will be executed as a sequence of right outer joins,
starting at the right.
Note that the join operation requires reading all files into memory. Don't try
this on very large files.

Related

Ffmpeg invalid data when trying to merge videos

I have a below function for merging videos in Python without re-encoding using FFmpeg:
def merge():
"""
This function merges a group of videos into one long video.
It is used for merging parts that are cut from original video into a new one.
The videos are merged using ffmpeg and for the ffmpeg
to use the concat method on them, the video names must be put inside
the vidlist.txt file.
New video is saved to the file named output.mp4.
"""
command = "ffmpeg -hide_banner -loglevel error -f concat -safe 0 -i vidlist.txt -c copy output.mp4"
# merge multiple parts of the video
subprocess.call(command, shell=True)
I store paths to videos to be merged inside a vidlist.txt file which looks like:
file 'out11.mp4'
file 'out21.mp4'
But, I am getting the following error:
vidlist.txt: Invalid data found when processing input
Here is the report file:
ffmpeg started on 2020-07-30 at 20:23:53
Report written to "ffmpeg-20200730-202353.log"
Log level: 48
Command line:
ffmpeg -hide_banner -f concat -safe 0 -i "C:\\Users\\miliv\\videocutter\\utils\\vidlist.txt" -c copy output.mp4 -report
Splitting the commandline.
Reading option '-hide_banner' ... matched as option 'hide_banner' (do not show program banner) with argument '1'.
Reading option '-f' ... matched as option 'f' (force format) with argument 'concat'.
Reading option '-safe' ... matched as AVOption 'safe' with argument '0'.
Reading option '-i' ... matched as input url with argument 'C:\Users\miliv\videocutter\utils\vidlist.txt'.
Reading option '-c' ... matched as option 'c' (codec name) with argument 'copy'.
Reading option 'output.mp4' ... matched as output url.
Reading option '-report' ... matched as option 'report' (generate a report) with argument '1'.
Finished splitting the commandline.
Parsing a group of options: global .
Applying option hide_banner (do not show program banner) with argument 1.
Applying option report (generate a report) with argument 1.
Successfully parsed a group of options.
Parsing a group of options: input url C:\Users\miliv\videocutter\utils\vidlist.txt.
Applying option f (force format) with argument concat.
Successfully parsed a group of options.
Opening an input file: C:\Users\miliv\videocutter\utils\vidlist.txt.
[concat # 000001aaafb8e400] Opening 'C:\Users\miliv\videocutter\utils\vidlist.txt' for reading
[file # 000001aaafb8f500] Setting default whitelist 'file,crypto,data'
[AVIOContext # 000001aaafb97700] Statistics: 0 bytes read, 0 seeks
C:\Users\miliv\videocutter\utils\vidlist.txt: Invalid data found when processing input
The issue was that I was trying to merge video before closing the vidlist.txt file which then didn't allow ffmpeg to open it

How can I concatenate multiple text or xml files but omit specific lines from each file?

I have a number of xml files (which can be considered as text files in this situation) that I wish to concatenate. Normally I think I could do something like this from a Linux command prompt or bash script:
cat somefile.xml someotherfile.xml adifferentfile.xml > out.txt
Except that in this case, I need to copy the first file in its entirety EXCEPT for the very last line, but in all subsequent files omit exactly the first four lines and the very last line (technically, I do need the last line from the last file but it is always the same, so I can easily add it with a separate statement).
In all these files the first four lines and the last line are always the same, but the contents in between varies. The names of the xml files can be hardcoded into the script or read from a separate data file, and the number of them may vary from time to time but always will number somewhere around 10-12.
I'm wondering what would be the easiest and most understandable way to do this. I think I would prefer either a bash script or maybe a python script, though I generally understand bash scripts a little better. What I can't get my head around is how to trim off just those first four lines (on all but the first file) and the last line of every file. My suspicion is there's some Linux command that can do this, but I have no idea what it would be. Any suggestions?
sed '$d' firstfile > out.txt
sed --separate '1,4d; $d' file2 file3 file4 >> out.txt
sed '1,4d' lastfile >> out.txt
It's important to use the --separate (or shorter -s) option so that the range statements 1,4 and $ apply to each file individually.
From GNU sed manual:
-s, --separate
By default, sed will consider the files specified on the command line as a single continuous long stream. This GNU sed
extension allows the user to consider them as separate files.
Do it in two steps:
use the head command (to get the lines you want)
Use cat to combine
You could use temp files or bash trickery.

Bash/Python to gather several paths to one and replacing the file name by '*' character

I'm coding a bash script in order to generate .spec file (building RPM) automatically. I read all the files in the directory (which I hope to convert it into rpm package) and write all the paths of files needed to install in .spec file, I realize that I need to shorten them. An example:
/tmp/a/1.jpg
/tmp/a/2.conf
/tmp/a/b/srf.cfg
/tmp/a/b/ssp.cfg
/tmp/a/conf_web_16.2/c/.htaccess
/tmp/a/conf_web_16.2/c/.htaccess.WebProv
/tmp/a/conf_web_16.2/c/.htprofiles
=> What I want to get:
/tmp/a/*.jpg
/tmp/a/*.conf
/tmp/a/b/*.cfg
/tmp/a/conf_web_16.2/c/*
/tmp/a/conf_web_16.2/c/*.WebProv
You guys please give me some advice about my problem. I hope you guys can suggest your idea in bash shell, python or C. Thank you in advance.
To convert any file name which contains a dot in a character other than the first into a wildcard covering the part up to just before the dot, and any remaining files to just a wildcard,
sed -e 's%/[^/][^/]*\(\.[^./]*\)$%/*\1%' -e t -e 's%/[^/]*$%/*%'
The behavior of sed is to read its input one line at a time, and execute the script of commands on each in turn. The s%foo%bar% substitution command replaces a regex match with a string, and the t command causes the script to skip further substitutions if one was already performed on the current line. (I'm simplifying somewhat.) The first regex matches file names which contain a dot in a position other than the first, and captures the match from the dot through the end in a back reference which is used in the substitution as well (that's the \1). The second is applied to any remaining file names, because of the t command in between.
The result will probably need to be piped to sort -u to remove any duplicates.
If you don't have a list of the file names, you can use find to pipe in a listing.
find . -type f | sed ... | sort -u

sys.argv arguments with spaces

I'm trying to input folder names as sys.argv arguments, but am having problem with folder names that have spaces, which become multiple variables.
For example, from the command line below, "Folder Name" becomes two variables.
Program.py D:\Users\Erick\Desktop\Folder Name
Any solutions?
Space is the delimiter for command line arguments. You'll be better off not using spaces in your directory and file names if possible. For entering an argument which has space in it you'll have to enclose it in quotes "folder with space".
Program.py "D:\Users\Erick\Desktop\Folder Name"
Assuming input is always going to be a single file/folder path:
path = " ".join(sys.argv[1:])
To extend the simplicity of Arshiyan's answer for a case involving multiple paths, you could join the paths with a delimiter such as a hash, and then split the resulting string when it gets to python...
paths = " ".join(sys.argv[1:]).split("#")

Copy the nth column of all the files in a directory into a single file

I've a directory containing many .csv files. How can I extract the nth column of every file into a new file column-wise?
For example:
File A:
111,222,333
111,222,333
File B:
AAA,BBB,CCC
AAA,BBB,CCC
File C:
123,456,789
456,342,122
and so on...
If n = 2, I want my resultant file to be:
222,BBB,456,...
222,BBB,342,...
where ... represents that there will be as many columns as the number of files in the directory.
My try so far:
#!/bin/bash
for i in `find ./ -iname "*.csv"`
do
awk -F, '{ print $2}' < $i >> result.csv ## This would append row-wise, not column-wise.
done
UPDATE:
I'm not trying to just join two files. There are 100 of files in a particular directory, and I want to copy the nth column of all the files into a single file. I gave two files as an example to show how I want the data to be if there were only two files.
As pointed out in the comments, joining two files is trivial but joining multiple files may be not that easy which is the whole point of my question. Would python help to do this job?
Building on triplee's solution, here's a generic version which uses eval:
eval paste -d, $(printf "<(cut -d, -f2 %s) " *.csv)
I'm not too fond of eval (always be careful when using it), but it has its uses.
Hmm. My first thought is to have both an outer and inner loop. The outer loop would be a counter on line number. The inner loop would go through the csv files. You'd need to use head/tail in the inner loop to get the correct line number so you could grab the right field.
An alternative is to use the one loop you have now but write each line to a separate file and then merge them.
Neither of these seem ideal. Quite honestly, I'd do this in Perl so you could use an actual in memory data structure and avoid the need to have complex logic.
this one liner should work:
awk -F, -v OFS="," 'NR==FNR{a[NR]=$2;next}{print a[FNR],$2}' file1 file2
Assuming Bash process substitutions are acceptable (i.e. you don't require the solution to be portable to systems where Bash is not available);
paste -d, <(cut -d, -f2 file1) <(cut -d, -f2 file2) <(cut -d, -f2 file3) # etc
A POSIX solution requires temporary files instead.

Categories

Resources