python: merge two csv files

python: merge two csv files - python

I have a problem while I'm doing my assignment with python.
I'm new to python so I am a complete beginner.
Question: How can I merge two files below?
s555555,7
s333333,10
s666666,9
s111111,10
s999999,9
and
s111111,,,,,
s222222,,,,,
s333333,,,,,
s444444,,,,,
s555555,,,,,
s666666,,,,,
s777777,,,,,
After merging, it should look something like:
s111111,10,,,,
s222222,,,,,
s333333,10,,,,
s444444,,,,,
s555555,7,,,,
s666666,9,,,,
s777777,,,,,
s999999,9,,,,
Thanks for reading and any helps would be appreciated!!!

Here are the steps you can follow for one approach to the problem. In this I'll be using FileA, FileB and Result as the various filenames.
One way to approach the problem is to give each position in the file (each ,) a number to reference it by, then you read the lines from FileA, then you know that after the first , you need to put the first line from FileB to build your result that you will write out to Result.
Open FileA. Ideally you should use the with statement because it will automatically close the file when its done. Or you can use the normal open() call, but make sure you close the file after you are done.
Loop through each line of FileA and add it to a list. (Hint: you should use split()). Why a list? It makes it easier to refer to items by index as that's our plan.
Repeat steps 1 and 2 for FileB, but store it in a different list variable.
Now the next part is to loop through the list of lines from FileA, match them with the list from FileB, to create a new line that you will write to the Result file. You can do this many ways, but a simple way is:
First create an empty list that will store your results (final_lines = [])
Loop through the list that has the lines for FileA in a for loop.
You should also keep in mind that not every line from FileA will have a corresponding line in FileB. For every first "bit" in FileA's list, find the corresponding line in FileB's list, and then get the next item by using the index(). If you are keen you would have realized that the first item is always 0 and the next one is always 1, so why not simply hard code the values? If you look at the assignment; there are multiple ,s so it could be that at some point you have a fourth or fifth "column" that needs to be added. Teachers love to check for this stuff.
Use append() to add the items in the right order to final_lines.
Now that you have the list of lines ready, the last part is simple:
Open a new file (use with or open)
Loop through final_lines
Write each line out to the file (make sure you don't forget the end of line character).
Close the file.
If you have any specific questions - please ask.

Not relating to python, but on linux:
sort -k1 c1.csv > sorted1
sort -k1 c2.csv > sorted2
join -t , -11 -21 -a 1 -a 2 sorted1 sorted2
Result:
s111111,10,,,,,
s222222,,,,,
s333333,10,,,,,
s444444,,,,,
s555555,7,,,,,
s666666,9,,,,,
s777777,,,,,
s999999,9

Make a dict using the first element as a primary key, and then merge the rows?
Something like this:
f1 = csv.reader(open('file1.csv', 'rb'))
f2 = csv.reader(open('file2.csv', 'rb'))
mydict = {}
for row in f1:
mydict[row[0]] = row[1:]
for row in f2:
mydict[row[0]] = mydict[row[0]].extend(row[1:])
fout = csv.write(open('out.txt','w'))
for k,v in mydict:
fout.write([k]+v)

Related

Storing last n values for x in txt file

If I have a program that asks for a user's name and score, opens a .txt file, searches the file for their name like:
for x in f.readlines():
if name in x.strip():
#etc
If the name is found it has to edit that line and add the new score, but it must only store x number of scores, say the last 4 scores. So if 4 scores are already stored, it must delete the oldest one so as to only keep the latest 4 scores.
If the name isn't found then it's just a simple append to end of file.
How would I accomplish this?

I would read the file and put it (parse) in some kind of datastruce (e.g. a dictionary), then update the values in the dict and afterwards write it back. It may not be the most efficient way, but the way I used it most.

First, you need to figure out the format of the line. How will you separate names and scores? I'm using a comma.
for line in f: # readlines() is not necessary here
if name in line:
name, s = line.split(',', 1)
scores = s.split(',')
if len(scores) > 4:
del scores[0] # remove the oldest one
scores.append(the_new_score)
new_line = "%s,%s" % (name, ','.join(scores))
And do something with new_line.
(note: this is just a very quick idea for how to accomplish this. This does not update the file in-place nor write out the results anywhere. That is left as an exercise to the reader, unless you actually need help with that part too)

Looping a Python split function

I'm trying to strip subdomains off of a large list of domains in a text file. The script works but only for the last domain in the list. I know the problem is in the loop but can't pinpoint the extact issue. Thanks for any assistance:)
with open ("domainlist.txt", "r") as datafile:
s = datafile.read()
for x in s:
t = '.'.join(s.split('.')[-2:])
print t
this will take "example.test.com" and "return test.com". The only problem is it won't perform this for every domain in the list - only the last one.

What you want is to build up a new list, by modifying the elements of an old one, fortunately, Python has the list comprehension - perfect for this job.
with open("domainlist.txt", "r") as datafile:
modified = ['.'.join(x.split('.')[-2:]) for x in datafile]
This behaves exactly like creating a list and adding items to it in a for loop, except faster and nicer to read. I recommend watching the video linked above for more information on how to use them.
Note that file.read() reads the entire thing in as one big string, what you wanted was probably to loop over the lines of the file, which is done just by looping over the file itself. Your current loop loops of the individual characters of the file, rather than lines.

You are overwriting t in each loop iteration, so naturally only the value from the last iteration stays in t. INstead put the string inside a list with list.append.

Try this out. Better readability.
with open ("domainlist.txt", "r") as datafile:
s = datafile.readlines()
t = []
for x in s:
t.append('.'.join(x.split('.')[-2:]))
print t

csv.reader only reading in one line

I am pretty new to python. I am trying to process data on a very large .csv file (~6.8 million lines). An example of the lines would look like:
Group1.1 57645 0.0954454545
Group1.1 57662 0.09556544778
Group1.13 500 0.357114538
Group1.13 504 0.320618298
Group1.13 2370 0.483851368
Group1.14 42 0.5495688
The first column gives the group, the second gives the position and the third gives the value I am reading in to run a calculation on. I am trying to perform these calculations in a "sliding window" based on the position. Another factor is that each group is calculated separately from one another because the position number restarts for each group. In my code I am first trying to read in the group ID's as a list before I do anything, "uniqifying" that list, and then using that list as a basis for only performing the "sliding window" over that specific group. I then move to the next group ID in the unique list and run the calculation again. Here is the basics of my code (the unique1 function is a simple method to uniqify a list:
for row in reader:
scaffolds.append(row[0])
unique1(scaffolds)
newfile.seek(0)
reader=csv.reader((line.replace('\0','') for line in newfile), delimiter="\t")
if row[0] == unique_scaffolds[i]:
#...perform the calculations
else:
i+=1
My problem that I am running into is that it is only reading in the very first line of my data set and nothing more. So if I insert a "print row" right after the "for row in reader", I get an output like this:
['Group1.1', '424', '0.082048032']
If I write this exact same code without any of the further calculations and loops following, it will print every single row in the data set. In this situation how would I read in every line at the beginning of this loop?

You are re-initializing reader each time. Essentially this is causing it to get stuck on the first line. Try this
reader=csv.reader((line.replace('\0','') for line in newfile), delimiter="\t")
for row in reader:
scaffolds.append(row[0])
unique1(scaffolds)
newfile.seek(0)
if row[0] == unique_scaffolds[i]:
#...perform the calculations
else:
i+=1

It looks to me like you're replacing your reader object inside the loop. Fix that (or get rid of it) and you'll probably have better luck getting this to work.

Realize that cvsreader will only read one line in at a time. You will have to generate your own list by reading them in, one line at a time.

Python: Where do I add new line breaks while creating a csv file?

I am new to programming in general and very new to Python. I have a csv list that is read into python, shuffled, and then written out in a different order.
In its current state, it prints out a long list of the items that are shuffled. I need to have them each printed onto a new row instead. To do that I understand I need to add \n somewhere, but I don't know where. Should I add it to the section of the code where the list is created in the first place, or where the csv file is written? I am guessing the latter, so here is the relevant code, but I can paste more if necessary:
make_list = csv.writer(open('026a_te.csv', 'wb'))
make_list.writerow(list_a)
Where do I add \n so that each element of list_a is written to a new row in the output file 026a_te.csv?

You're only writing a single row to the output csv, so naturally it's going to be all on one line.
You should have a list of lists to write using writerows instead of writerow, or you should be in a loop that calls writerow multiple times. The newlines will be added for you automatically.

How do I parse a listing of files to get just the filenames in Python?

So lets say I'm using Python's ftplib to retrieve a list of log files from an FTP server. How would I parse that list of files to get just the file names (the last column) inside a list? See the link above for example output.

Using retrlines() probably isn't the best idea there, since it just prints to the console and so you'd have to do tricky things to even get at that output. A likely better bet would be to use the nlst() method, which returns exactly what you want: a list of the file names.

This best answer
You may want to use ftp.nlst() instead of ftp.retrlines(). It will give you exactly what you want.
If you can't, read the following :
Generators for sysadmin processes
In his now famous review, Generator Tricks For Systems Programmers An Introduction, David M. Beazley gives a lot of receipes to answer to this kind of data problem with wuick and reusable code.
E.G :
# empty list that will receive all the log entry
log = []
# we pass a callback function bypass the print_line that would be called by retrlines
# we do that only because we cannot use something better than retrlines
ftp.retrlines('LIST', callback=log.append)
# we use rsplit because it more efficient in our case if we have a big file
files = (line.rsplit(None, 1)[1] for line in log)
# get you file list
files_list = list(files)
Why don't we generate immediately the list ?
Well, it's because doing it this way offer you much flexibility : you can apply any intermediate generator to filter files before turning it into files_list : it's just like pipe, add a line, you add a process without overheat (since it's generators). And if you get rid off retrlines, it still work be it's even better because you don't store the list even one time.
EDIT : well, I read the comment to the other answer and it says that this won't work if there is any space in the name.
Cool, this will illustrate why this method is handy. If you want to change something in the process, you just change a line. Swap :
files = (line.rsplit(None, 1)[1] for line in log)
and
# join split the line, get all the item from the field 8 then join them
files = (' '.join(line.split()[8:]) for line in log)
Ok, this may no be obvious here, but for huge batch process scripts, it's nice :-)

And a slightly less-optimal method, by the way, if you're stuck using retrlines() for some reason, is to pass a function as the second argument to retrlines(); it'll be called for each item in the list. So something like this (assuming you have an FTP object named 'ftp') would work as well:
filenames = []
ftp.retrlines('LIST', lambda line: filenames.append(line.split()[-1]))
The list 'filenames' will then be a list of the file names.

Is there any reason why ftplib.FTP.nlst() won't work for you? I just checked and it returns only names of the files in a given directory.

Since every filename in the output starts at the same column, all you have to do is get the position of the dot on the first line:
drwxrwsr-x 5 ftp-usr pdmaint 1536 Mar 20 09:48 .
Then slice the filename out of the other lines using the position of that dot as the starting index.
Since the dot is the last character on the line, you can use the length of the line minus 1 as the index. So the final code is something like this:
lines = ftp.retrlines('LIST')
lines = lines.split("\n") # This should split the string into an array of lines
filename_index = len(lines[0]) - 1
files = []
for line in lines:
files.append(line[filename_index:])

If the FTP server supports the MLSD command, then please see section “single directory case” from that answer.
Use an instance (say ftpd) of the FTPDirectory class, call its .getdata method with connected ftplib.FTP instance in the correct folder, then you can:
directory_filenames= [ftpfile.name for ftpfile in ftpd.files]

I believe it should work for you.
file_name_list = [' '.join(each_file.split()).split()[-1] for each_file_detail in file_list_from_log]
NOTES -
Here I am making a assumption that you want the data in the program (as list), not on console.
each_file_detail is each line that is being produced by the program.
' '.join(each_file.split())
To replace multiple spaces by 1 space.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.