Header of data file disappears when sorting

Header of data file disappears when sorting - python

I have a csv file with rows of data. The first row is headers for the columns.
I'd like to sort the data by some parameter (specifically, the first column), but of course keep the header where it is.
When I do the following, the header disappears completely and is not included in the output file.
Can anyone please advise how to keep the header but skip it and sort the rest of the rows?
(for fwiw, the first column is a mix of numbers and letters).
Thanks!
Here's my code:
import csv
import operator
sankey = open('rawforsankey.csv', "rb")
raw_reader = csv.reader(sankey)
raw_data = []
for row in raw_reader:
raw_data.append(row)
raw_data_sorted = sorted(raw_data, key=operator.itemgetter(0))
myfiletest = open('newfiletest.csv', 'wb')
wr = csv.writer(myfiletest,quoting = csv.QUOTE_ALL)
wr.writerows(raw_data_sorted)
sankey.close()
myfiletest.close()
EDIT: should mention I tried this variation in the code:
raw_data_sorted = sorted(raw_data[1:], key=operator.itemgetter(0))
but got the same result

You sorted all data, including the header, which means it is still there but perhaps in the middle of your resulting output somewhere.
This is how you'd sort a CSV on one column, preserving the header:
import csv
import operator
with open('rawforsankey.csv', "rb") as sankey:
raw_reader = csv.reader(sankey)
header = next(raw_reader, None)
sorted_data = sorted(raw_reader, operator.itemgetter(0))
with open('newfiletest.csv', 'wb') as myfiletest:
wr = csv.writer(myfiletest, quoting=csv.QUOTE_ALL)
if header:
wr.writerow(header)
wr.writerows(sorted_data)
Just remember that sorting is done lexicographically as all columns are strings. So 10 sorts before 9, for example. Use a more specific sorting key if your data is numeric, for example.

Related

How do I sort data from a CSV by a column?

I need to organise a CSV file by user ID in ascending order. The csv file has a header that I would like to keep at the top of the document.
the headers are below with 13500 rows of data
User_ID;firstname;lastname;location
The code i have currently omits the headings. If I remove the heading=next(csv_reader) line, it puts the headings at the bottom of the document.
The current output does not also put them in correct order but goes off the first value of the ID and not the whole number (ID=13000 comes before ID=2000 through 9999)
import csv
import operator
file = open("file.csv", 'r')
csv_reader = csv.reader(file, delimiter=';')
heading=next(csv_reader)
sort = sorted(csv_reader, key=operator.itemgetter(0))
for eachline in sort:
print(eachline)

Your current sort happens in lexical order, because the elements of your CSV file are strings. If you want to sort them as integers, have your key function in the sorted call convert them to integers.
sorted_data = sorted(csv_reader, key=lambda row: int(row[0]))
I used a lambda instead of operator.itemgetter(0) because we needed to convert to an int anyway, and this is the most convenient way to do so.
To print the header with the data, print it before printing your data:
print(heading)
for line in sorted_data
print(line)

You can achieve it with pandas too:
import pandas as pd
df = pd.read_csv(open('file.csv'), delimiter=';')
sorted_df = df.sort_values(by=["User_ID"], ascending=True)
sorted_df.to_csv('file_sorted.csv', sep=';', index=False)
print(sorted_df.to_string(index=False))

How to find max and min values within lists without using maps/SQL?

I'm learning python and have a data set (csv file) I've been able to split the lines by comma but now I need to find the max and min value in the third column and output the corresponding value in the first column in the same row.
This is the .csv file: https://www.dropbox.com/s/fj8tanwy1lr24yk/loan.csv?dl=0
I also can't use Pandas or any external libraries; I think it would have been easier if I used them
I have written this code so far:
f = open("loanData.csv", "r")
mylist = []
for line in f:
mylist.append(line)
newdata = []
for row in mylist:
data = row.split(",")
newdata.append(data)

I'd use the built-in csv library for parsing your CSV file, and then just generate a list with the 3rd column values in it:
import csv
with open("loanData.csv", "r") as loanCsv:
loanCsvReader = csv.reader(loanCsv)
# Comment out if no headers
next(loanCsvReader, None)
loan_data = [ row[2] for row in loanCsvReader]
max_val = max(loan_data)
min_val = min(loan_data)
print("Max: {}".format(max_val))
print("Max: {}".format(min_val))
Don't know if the details of your file, whether it has a headers or not but you can comment out
next(loanCsvReader, None)
if you don't have any headers present

Something like this might work. The index would start at zero, so the third column should be 2.
min = min([row.split(',')[2] for row in mylist])
max = max([row.split(',')[2] for row in mylist])
Separately, you could probably read and reformat your data to a list with the following:
with open('loanData.csv', 'r') as f:
data = f.read()
mylist = list(data.split('\n'))
This assumes that the end of each row of data is newline (\n) delimited (Windows), but that might be different depending on the OS you're using.

Making Python ignore CSV separator instruction [duplicate]

I am asking Python to print the minimum number from a column of CSV data, but the top row is the column number, and I don't want Python to take the top row into account. How can I make sure Python ignores the first line?
This is the code so far:
import csv
with open('all16.csv', 'rb') as inf:
incsv = csv.reader(inf)
column = 1
datatype = float
data = (datatype(column) for row in incsv)
least_value = min(data)
print least_value
Could you also explain what you are doing, not just give the code? I am very very new to Python and would like to make sure I understand everything.

You could use an instance of the csv module's Sniffer class to deduce the format of a CSV file and detect whether a header row is present along with the built-in next() function to skip over the first row only when necessary:
import csv
with open('all16.csv', 'r', newline='') as file:
has_header = csv.Sniffer().has_header(file.read(1024))
file.seek(0) # Rewind.
reader = csv.reader(file)
if has_header:
next(reader) # Skip header row.
column = 1
datatype = float
data = (datatype(row[column]) for row in reader)
least_value = min(data)
print(least_value)
Since datatype and column are hardcoded in your example, it would be slightly faster to process the row like this:
data = (float(row[1]) for row in reader)
Note: the code above is for Python 3.x. For Python 2.x use the following line to open the file instead of what is shown:
with open('all16.csv', 'rb') as file:

To skip the first line just call:
next(inf)
Files in Python are iterators over lines.

Borrowed from python cookbook,
A more concise template code might look like this:
import csv
with open('stocks.csv') as f:
f_csv = csv.reader(f)
headers = next(f_csv)
for row in f_csv:
# Process row ...

In a similar use case I had to skip annoying lines before the line with my actual column names. This solution worked nicely. Read the file first, then pass the list to csv.DictReader.
with open('all16.csv') as tmp:
# Skip first line (if any)
next(tmp, None)
# {line_num: row}
data = dict(enumerate(csv.DictReader(tmp)))

You would normally use next(incsv) which advances the iterator one row, so you skip the header. The other (say you wanted to skip 30 rows) would be:
from itertools import islice
for row in islice(incsv, 30, None):
# process

use csv.DictReader instead of csv.Reader.
If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as field names. you would then be able to access field values using row["1"] etc

Python 2.x
csvreader.next()
Return the next row of the reader’s iterable object as a list, parsed
according to the current dialect.
csv_data = csv.reader(open('sample.csv'))
csv_data.next() # skip first row
for row in csv_data:
print(row) # should print second row
Python 3.x
csvreader.__next__()
Return the next row of the reader’s iterable object as a list (if the
object was returned from reader()) or a dict (if it is a DictReader
instance), parsed according to the current dialect. Usually you should
call this as next(reader).
csv_data = csv.reader(open('sample.csv'))
csv_data.__next__() # skip first row
for row in csv_data:
print(row) # should print second row

The documentation for the Python 3 CSV module provides this example:
with open('example.csv', newline='') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
# ... process CSV file contents here ...
The Sniffer will try to auto-detect many things about the CSV file. You need to explicitly call its has_header() method to determine whether the file has a header line. If it does, then skip the first row when iterating the CSV rows. You can do it like this:
if sniffer.has_header():
for header_row in reader:
break
for data_row in reader:
# do something with the row

this might be a very old question but with pandas we have a very easy solution
import pandas as pd
data=pd.read_csv('all16.csv',skiprows=1)
data['column'].min()
with skiprows=1 we can skip the first row then we can find the least value using data['column'].min()

The new 'pandas' package might be more relevant than 'csv'. The code below will read a CSV file, by default interpreting the first line as the column header and find the minimum across columns.
import pandas as pd
data = pd.read_csv('all16.csv')
data.min()

Because this is related to something I was doing, I'll share here.
What if we're not sure if there's a header and you also don't feel like importing sniffer and other things?
If your task is basic, such as printing or appending to a list or array, you could just use an if statement:
# Let's say there's 4 columns
with open('file.csv') as csvfile:
csvreader = csv.reader(csvfile)
# read first line
first_line = next(csvreader)
# My headers were just text. You can use any suitable conditional here
if len(first_line) == 4:
array.append(first_line)
# Now we'll just iterate over everything else as usual:
for row in csvreader:
array.append(row)

Well, my mini wrapper library would do the job as well.
>>> import pyexcel as pe
>>> data = pe.load('all16.csv', name_columns_by_row=0)
>>> min(data.column[1])
Meanwhile, if you know what header column index one is, for example "Column 1", you can do this instead:
>>> min(data.column["Column 1"])

For me the easiest way to go is to use range.
import csv
with open('files/filename.csv') as I:
reader = csv.reader(I)
fulllist = list(reader)
# Starting with data skipping header
for item in range(1, len(fulllist)):
# Print each row using "item" as the index value
print (fulllist[item])

I would convert csvreader to list, then pop the first element
import csv
with open(fileName, 'r') as csvfile:
csvreader = csv.reader(csvfile)
data = list(csvreader) # Convert to list
data.pop(0) # Removes the first row
for row in data:
print(row)

I would use tail to get rid of the unwanted first line:
tail -n +2 $INFIL | whatever_script.py

just add [1:]
example below:
data = pd.read_csv("/Users/xyz/Desktop/xyxData/xyz.csv", sep=',', header=None)**[1:]**
that works for me in iPython

Python 3.X
Handles UTF8 BOM + HEADER
It was quite frustrating that the csv module could not easily get the header, there is also a bug with the UTF-8 BOM (first char in file).
This works for me using only the csv module:
import csv
def read_csv(self, csv_path, delimiter):
with open(csv_path, newline='', encoding='utf-8') as f:
# https://bugs.python.org/issue7185
# Remove UTF8 BOM.
txt = f.read()[1:]
# Remove header line.
header = txt.splitlines()[:1]
lines = txt.splitlines()[1:]
# Convert to list.
csv_rows = list(csv.reader(lines, delimiter=delimiter))
for row in csv_rows:
value = row[INDEX_HERE]

Simple Solution is to use csv.DictReader()
import csv
def read_csv(file): with open(file, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
print(row["column_name"]) # Replace the name of column header.

Format csv data and write each row to a json

I'm trying to write each row of a csv to a json (this will then be posted and looped back through so overwriting the json file is not a big deal here). I have code which seems to do this well enough, but also need to some of the data to be floats/integers rather than strings.
I have a method which works for this in other places, but cannot manage to get the two to agree with each other.
Could anyone point me in the right direction to be able to format the csv data before sending it out as a json? Below is the code for when headers are left in, though I also have a tweaked version which just has raw data in the csv and uses fieldnames for the headers instead.
import csv
import json
input_file = 'Test3.csv'
output_file_template = 'Test.json'
with open(input_file, 'r', encoding='utf8') as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
rows = list(reader)
for i in range(len(rows)):
out = json.dumps(rows[1*i:1*(i+1)])
with open(output_file_template.format(i), 'w') as f:
f.write(out)
Data is in a format like this:
OrderType OrderStatus OrderDateTime SettlementDate MarketId OrderRoute
Sale Executed 18/11/2016 23/11/2016 1 None
Sale Executed 18/11/2016 23/11/2016 1 None
Sale Executed 18/11/2016 23/11/2016 1 None
With row[4] producing the key error.

In your loop if the float/int data is consistently in the same spot, you can simply cast the values.
for i, row in enumerate(rows):
row[0] = int(row[0]) # this column stores ints
row[1] = float(row[1]) # this column stores floats
out = json.dumps([row])
with open(output_file_template.format(i), 'w') as f:
f.write(out)
I don't know if columns 0 and 1 hold ints and floats, but you can change that as necessary.
Update:
It appears row is an OrderedDict, so you'll just need to use the key instead of an index:
row['MarketId'] = int(row['MarketId'])

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?

Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.

This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']

If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Header of data file disappears when sorting - python

Related

How do I sort data from a CSV by a column?

How to find max and min values within lists without using maps/SQL?

Making Python ignore CSV separator instruction [duplicate]

Format csv data and write each row to a json

Python: General CSV file parsing and manipulation

Categories

Resources