How to sort nth column columns in text file - python

So my data looks like this:
1 3456542 5 may 2014
2 1245678 4 may 2014
3 4256876 2 may 2014
4 5643156 6 may 2014
.....
I want to sort the 2nd column of 7 digit ID numbers from greatest to least. Also depending on the first number in the ID number I'd like to send each row to a different text file (i.e. for all the ID numbers that start with 3, send that entire row into a text file, for all the ID numbers that start with 1 send that entire row to another text file... and so on). What is the easiest way to accomplish something like this?

You could try using pandas. That makes it really easy.
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
txt = StringIO('''
a b c d e
1 3456542 5 may 2014
2 1245678 4 may 2014
3 4256876 2 may 2014
4 5643156 6 may 2014
''')
df = pd.read_csv(txt, delim_whitespace=True)
df.sort('b', ascending=False)

Assuming that your input data is text, I would start by separating lines from each other and columns within lines. See the str.split() function for this.
The result should be a list of lists. You can then sort by the second column with the sort() or sorted() function if you provide the keyword argument key=. You might have to convert the number columns to int so that they will be sorted from small to large (and not alphabetical).
For the last part of your question, you could use itertools.groupby() which provides you with a grouping functionality like you requested.
This should get you started. Another option would be to use pandas.

"I wasn't asking for an answer, I was asking where to start conceptually."
Start reading the text file using file.readlines, split the data using line.strip().split(" ", 2) wich will give you data in the following format:
['1', '3456542', ' 5 may 2014']
Now you should be able to complete your task.
Hint: look up the builtin functions int() and sorted().

Heres my way of doing it:
import csv
from operator import itemgetter
#read in file
file_lines = []
with open("test.txt", "r") as csv_file:
reader = csv.reader(csv_file, delimiter=" ")
for row in reader:
file_lines.append(row)
#sort
file_lines.sort(key=itemgetter(1))
#write sorted file
with open("test_sorted.txt", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=" ")
for row in file_lines:
writer.writerow(row)
#separate files
for row in file_lines:
file_num = row[1][0]
with open("file_{0}.txt".format(file_num), "w") as f:
writer = csv.writer(f, delimiter=" ")
writer.writerow(row)

Related

Extract a column from csv file which has few rows with extra commas as value(address field), which causes the column count to break

I need to access values of a column that occur after the address column, but due to presence of comma in the address field, I causes the file to count extra columns.
Example csv:
id,name,place,address,age,type,dob,date
1,Murtaza,someplace,Street,MA,22,B,somedate,somedate,
2,Murtaza,someplace,somestreet,45,C,somedate,somedate,
3,Murtaza,someplace,somestreet,MA,44,V,somedate,somedate
Excel output:
id name place address age type dob date newcolumn9
1 Murtaza someplace somestreet MA 22 B somedate somedate
2 Murtaza someplace somestreet 45 C somedate somedate
3 Murtaza someplace somestreet MA 44 V somedate somedate
This is what I tried:
# I was able to see that all columns before the column with extra commas displayed fine using this code.
import pandas as pd
import csv
with open('Myfile', 'rb') as f,
open('Newfile', 'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 2)
writer.writerow(row)
I am trying to do this in python pandas.
If I can parse the csv in reverse ill be able to get the proper values regardless of the error.
From the above example, I want extract age column.
panda, or simply re.split():
import re
your_csv_file=open('your_csv_file.csv','r').read()
i_column=2 #index of desired column, counted from back
lines=re.split('\n',your_csv_file)[:-1] #eventually remove last (empty) line
your_column=[]
for line in lines:
your_column.append(re.split(',',line)[-i_column]) #the minus affects indexing beginning at the end
print(your_column)
executed on a .csv-file like the one below
4rth,askj,fpou,ABC,aekert
kjgf,poiuf,pejhh,,oeiu,DEF,akdhg
iuzrit,fslgk,gth,,rhf,,rhe,GHI,ozug
pwiuto,,,,eflgjkhrlguiazg,JKL,rgj
this returns
['ABC', 'DEF', 'GHI', 'JKL']
I think the best way to do this might be to write a separate script that removes the faulty commas. But if you want to just ignore the faulty lines, then that can be done by reading in each line into a StringIO and ignore the line with the incorrect number of commas. So if you're expecting 4 columns:
from cStringIO import StringIO
import pandas
s = StringIO()
correct_columns = 4
with open('MyData.csv') as file:
for line in file:
if len(','.split(line)) == correct_columns:
s.write(line)
s.seek(0)
pandas.read_csv(s)

Loop within loop when comparing csv files in Python

I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).
You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])
See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.
If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])
I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.

Remove row from CSV that contains empty cell using Python

I am splitting a CSV file based on a column with dates into separate files. However, some rows do contain a date but the others cells are empty. I want to remove these rows that contain empty cells from the CSV. But I'm not sure how to do this.
Here's is my code:
csv.field_size_limit(sys.maxsize)
with open(main_file, "r") as fp:
root = csv.reader(fp, delimiter='\t', quotechar='"')
result = collections.defaultdict(list)
next(root)
for row in root:
year = row[0].split("-")[0]
result[year].append(row)
for i,j in result.items():
row_count = sum(1 for row in j)
print(row_count)
file_path = "%s%s-%s.csv"%(src_path, i, row_count)
with open(file_path, 'w') as fp:
writer = csv.writer(fp, delimiter='\t', quotechar='"')
writer.writerows(j)
Pandas is perfect for this, especially if you want this to be easily adjusted to, say, other file formats. Of course one could consider it an overkill.
To just remove rows with empty cells:
>>> import pandas as pd
>>> data = pd.read_csv('example.csv', sep='\t')
>>> print data
A B C
0 1 2 5
1 NaN 1 9
2 3 4 4
>>> data.dropna()
A B C
0 1 2 5
2 3 4 4
>>> data.dropna().to_csv('example_clean.csv')
I leave performing the splitting and saving into separate files using pandas as an exercise to start learning this great package if you want :)
This would skip all all rows with at least one empty cell:
with open(main_file, "r") as fp:
....
for row in root:
if not all(map(len, row)):
continue
Pandas is Best in Python for handling any type of data processing.For help you can go through on link :- http://pandas.pydata.org/pandas-docs/stable/10min.html

Remove rows by keyword in column, and then remove all columns and save as text in python

This is kind of confusing I suppose, but I have a CSV with 3 columns,
Example:
name, product, type
John, car, new
Jim, truck, used
Jack, minivan, new
Jane, SUV, used
Jeff, car, used
First, I want to go through the CSV and remove all rows except for "new". Once that is done I want to remove all but the first column, and then save that list as a text file.
The code I have so far...
import csv
input_file = 'example.csv'
output_file = 'namesonly.txt'
reader = csv.reader(open(input_file,'rb'), delimiter=',')
for line in reader:
if "new" in line:
print line
With the code I have it prints just what I want:
John, car, new
Jack, minivan, new
Now that I have just the customers that bought "new" vehicles, I want to then cut the 2 columns on the right, leaving just a list of names. I then want to save that list of just names to a .txt file. This is where I am getting stuck, I don't know how to proceed from here.
This is no problem. Look at the following.
f = open('namesonly.txt', 'w')
...
for line in reader:
if "new" in line[2]:
#line = line.split(',') #<- you don't need this line because you are reading the input as a delimitd string already
f.write(line[0] + '\n') # write the first thing before the first comma (your names)
f.close()
This is untested, but something similar should work.
import csv
with open('example.csv') as infile, open('namesonly.txt', 'w') as outfile:
for name, _prod, condition in csv.reader(infile):
if condition.lower() == 'new':
continue
outfile.write(name)
outfile.write('\n')
While all approaches given until now work, they are all naive, and will perform poorly on large CSV files. The also require you to "manually" work with CSV files, and create for loops.
When ever you see files CSV files you should think of the two options: SQLite or Python Pandas.
SQLite, and it's built into your Python already. It uses SQL, so you need to learn some SQL ...
Pandas, uses a more Pythonic API to do the things you want to do, and it's not included (but it should not be complicated to install either...).
Here is how to what you wanted with Pandas:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('example.csv')
Get all the names (the first column):
In [3]: df['name']
Out[3]:
0 John
1 Jim
2 Jack
3 Jane
4 Jeff
Name: name, dtype: object
Find all new products:
In [18]: df[df[' type'] == ' new']
Out[18]:
name product type
0 John car new
2 Jack minivan new
You can assign the result, and then save it to a csv file.
In [19]: res = df[df[' type'] == ' used']
In [20]: res.to
res.to_clipboard res.to_dict res.to_hdf res.to_latex res.to_period res.to_sparse res.to_string
res.to_csv res.to_excel res.to_html res.to_msgpack res.to_pickle res.to_sql res.to_timestamp
res.to_dense res.to_gbq res.to_json res.to_panel res.to_records res.to_stata res.to_wide
In [20]: res.to_c
res.to_clipboard res.to_csv
In [20]: res.to_csv('new_products.csv')
Also note that Pandas can handle CSV files very efficiently since it is written in C.
About loading CSV with pandas
The CSV reader has tons of options. Check them out! I loaded the file naively and hence the space in the column name. If you think it's ugly, I would agree. You can pass the following keyword to fix the situation:
df = pd.read_csv('example.csv', delim_whitespace=True)
To show how simple pandas is
If you really want the names of those who have new products, as in the answer of Padraic Cunningham, you can simply concatenate methods:
In [46]: df[df['type'] == 'new'].name
Out[46]:
0 John
2 Jack
Name: name, dtype: object
In [47]: df[df['type'] == 'new'].name.to_csv('out.csv')
Just unpack using a generator expression and keep the name when the type row entry is equal to new:
import csv
with open("in.csv") as f, open("out.csv","w") as out:
wr = csv.writer(out)
wr.writerows((name,) for name, _, tpe in csv.reader(f) if tpe == "new")
in.csv:
name,product,type
John,car,new
Jim,truck,used
Jack,minivan,new
Jane,SUV,used
Jeff,car,used
out.csv:
John
Jack

Merge 2 csv file with one unique column but different header [duplicate]

This question already has answers here:
Merging two CSV files using Python
(2 answers)
Closed 7 years ago.
I want to merge 2 csv file using some scripting language (like bash script or python).
1st.csv (this data is from mysql query)
member_id,name,email,desc
03141,ej,ej#domain.com,cool
00002,jes,jes#domain.com,good
00002,charmie,charm#domain.com,sweet
2nd.csv (from mongodb query)
id,address,create_date
00002,someCity,20150825
00003,newCity,20140102
11111,,20150808
The examples are not the actual, though i know that some of the member_id from qsl and the id from mongodb are the same.
(*and i wish my output will be something like this)
desiredoutput.csv
meber_id,name,email,desc,address,create_date
03141,ej,ej#domain.com,cool,,
00002,jes,jes#domain.com,good,someCity,20150825
00002,charmie,charm#domain.com,sweet,
11111,,,,20150808
help will be much appreciated. thanks in advance
#########################################################################
#!/usr/bin/python
import csv
import itertools as IT
filenames = ['1st.csv', '2nd.csv']
handles = [open(filename, 'rb') for filename in filenames]
readers = [csv.reader(f, delimiter=',') for f in handles]
with open('desiredoutput.csv', 'wb') as h:
writer = csv.writer(h, delimiter=',', lineterminator='\n', )
for rows in IT.izip_longest(*readers, fillvalue=['']*2):
combined_row = []
for row in rows:
row = row[:1] # column where 1 know there are identical data
if len(row) == 1:
combined_row.extend(row)
else:
combined_row.extend(['']*1)
writer.writerow(combined_row)
for f in handles:
f.close()
#########################################################################
just read and tried this code(manipulate) in this site too
Since you haven't posted an attempt, I'll give you a general answer (using Python) to get you started.
Create a dict, d
Iterate over all the rows of the first file, convert each row into a list and store it in d using meber_id as the key and the list as the value.
Iterate over all the rows of the second file, convert each row into a list leaving out the id column and update the list under d[id] with the new list if d[id] exists, otherwise store the new list under d[id].
Finally, iterate over the values in d and print them out comma separated to a file.
Edit
In your attempt, you are trying to use izip_longest to iterate over the rows of both files at the same time. But this would work only if there were an equal number of rows in both files and they were in the same order.
Anyhow, here is one way of doing it.
Note: This is using the Python 3.4+ csv module. For 2.7 it might look a little different.
import csv
d = {}
with open("file1.csv", newline="") as f:
for row in csv.reader(f):
d.setdefault(row[0], []).append(row + [""] * 3)
with open("file2.csv", newline="") as f:
for row in csv.reader(f):
old_row = d.setdefault(row[0][0], [row[0], "", "", ""])
old_row[4:] = row[1:]
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
for rows in d.values():
writer.writerows(rows)
Here goes a suggestion using pandas I've got from this answer and pandas doc about merging.
import pandas as pd
first = pd.read_csv('1st.csv')
second = pd.read_csv('2nd.csv')
merged = pd.concat([first, second], axis=1)
This will output:
meber_id name email desc id address create_date
3141 ej ej#domain.com cool 2 someCity 20150825
2 jes jes#domain.com good 11 newCity 20140102
11 charmie charm#domain.com sweet 11111 NaN 20150808

Categories

Resources