Convert numerical data in csv from bytes to kilobytes using Python - python

What I have at the moment is that I take in a cvs file and determine the related data between a given start time and end time. I write this relevant data into a different cvs file. All of this works correctly.
What I want to do is convert all the numerical data (not touching the date or time) from the original cvs file from bytes into kilobytes and only take one decimal place when presenting the kilobyte value. These altered numerical data is what I want written into the new cvs file.
The numerical data seems to be read as a string so they I’m a little unsure how to do this, any help would be appreciated.
The original CSV (when opened in excel) is presented like this:
Date:-------- | Title1:----- | Title2: | Title3: | Title4:
01/01/2016 | 32517293 | 45673 | 0.453 |263749
01/01/2016 | 32721993 | 65673 | 0.563 |162919
01/01/2016 | 33617293 | 25673 | 0.853 |463723
But I want the new CSV to look something like this:
Date:-------- | Title1:--- | Title2: | Title3: | Title4:
01/01/2016 | 32517.2 | 45673 | 0.0 | 263.749
01/01/2016 | 32721.9 | 65673 | 0.0 | 162.919
01/01/2016 | 33617.2 | 25673 | 0.0 | 463.723
My Python function so far:
def edit_csv_file(Name,Start,End):
#Open file to be written to
f_writ = open(logs_folder+csv_file_name, 'a')
#Open file to read from (i.e. the raw csv data from the windows machine)
csvReader = csv.reader(open(logs_folder+edited_csv_file_name,'rb'))
#Remove double quotation marks when writing new file
writer = csv.writer(f_writ,lineterminator='\n', quotechar = '"')
for row in csvReader:
#Write the data relating to the modules greater than 10 seconds
if get_sec(row[0][11:19]) >= get_sec(Start):
if get_sec(row[0][11:19]) <= get_sec(End):
writer.writerow(row)
f_writ.close()

The following should do what you need:
import csv
with open('input.csv', 'rb') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
csv_output.writerow(next(csv_input)) # write header
for cols in csv_input:
for col in range(1, len(cols)):
try:
cols[col] = "{:.1f}".format(float(cols[col]) / 1024.0)
except ValueError:
pass
csv_output.writerow(cols)
Giving you the following output csv file:
Date:--------,Title1:-----,Title2:,Title3:,Title4:
01/01/2016,31755.2,44.6,0.0,257.6
01/01/2016,31955.1,64.1,0.0,159.1
01/01/2016,32829.4,25.1,0.0,452.9
Tested using Python 2.7.9

int() is the standard way in python to convert a string to an int. it is used like
int("5") + 1
this will return 6. Hope this helps.

Depending on what else you may find yourself working on, I'd be tempted to use pandas for this one - given a file with the contents you describe, after importing the pandas module:
import pandas as pd
Read in the csv file (automagically recognising that the 1st line is a header) - the delimiter in your case may not need specifying - if it's the default comma - but other delimiters are available - I'm a fan of the pipe '|' character.
csv = pd.read_csv("pandas_csv.csv",delimiter="|")
Then you can enrich/process your data as you like using the column names as references.
For example, to convert a column by some factor you might write:
csv['Title3'] = csv['Title3']/1024
The datatypes are again, automatically determined, so if a column is all numeric (as in the example) there's no need to do any conversion from datatype to datatype, 99% of the time, it figures it out correctly based on the data in the file.
Once you're happy with the edits, type
csv
To see a representation of the results, and then
csv.to_csv("pandas_csv.csv")
To save the results (in this case, overwriting the original file, but you may want to write something more like:
csv.to_csv("pandas_csv_kilobytes.csv")
There are more useful/powerful functions available, but I know no easier method for manipulating tabular data than this - it's better and more reliable than Excel, and in years to come, you will celebrate the day you started using pandas!
In this case, you've opened, edited and saved the file using the following 4 lines of code:
import pandas as pd
csv = pd.read_csv("pandas_csv.csv",delimiter="|")
csv['Title3'] = csv['Title3']/1024
csv.to_csv("pandas_csv_kilobytes.csv")
That's about as powerful and convenient as it gets.

And another solution using a func (bytesto) from: gist.github.com/shawnbutts/3906915
def bytesto(bytes, to):
a = {'k' : 1, 'm': 2, 'g' : 3, 't' : 4, 'p' : 5, 'e' : 6 }
r = float(bytes)
for i in range(a[to]):
r = r / 1024
return(int(r)) # ori not return int
with open('csvfile.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter='|', quotechar='|')
row=iter(data)
next(row) # Jump title
for row in data:
print 'kb= ' + str(bytesto((row[1]), 'k')), 'kb= ' + str(bytesto((row[2]), 'k')), 'kb= ' + str(bytesto((row[3]), 'k')), 'kb= ' + str(bytesto((row[4]), 'k'))
Result:
kb= 31755 kb= 44 kb= 0 kb= 257
kb= 31955 kb= 64 kb= 0 kb= 159
kb= 32829 kb= 25 kb= 0 kb= 452
Hope this help u a bit.

if s is your string representing a byte value, you can convert to a string representing a kilobyte value with a single decimal place like this:
'%.1f' % (float(s)/1024)
Alternatively:
str(round(float(s)/1024, 1))
EDIT:
To prevent errors for non-digit strings, you can just make a conditional
'%.1f' % (float(s)/1024) if s.isdigit() else ''

Related

Data Scraping from txt file with consistent structure

I'm working with a very old program that outputs the results for a batch query in a very odd format (at least for me).
Imagine having queried info for the objects A, B and C.
The output will look like this:
name : A
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : B
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : C
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
Do you have any idea of how to put the data in a more useful format?
A possible good format would be a table with columns A B C and rows p1, p2...
I had few ideas but I don't really know how to implement those:
Every object is separated by a ====== string, that means i can use this to separate in many .txt files the output
Then I can read the files with excel setting : as separator, obtaining a csv file with 2 columns (1 containing the p descriptors and one with the actual values)
Now i need to merge all the csv files into one single csv with as many columns as objects and px rows
I'd like to do this in python but I really don't know any package for this situation. Also the objects are a few hundreds so I need an automatized algorithm for doing that.
Any tip, advice or idea you can think of is welcome.
Here's a quick solution putting the data you say you need - not all labels - in a csv file. Each output line starts with the name A/B/C and then comes the values p1..x.
It has no handling of missing values, so in that case just the present values will be listed, thus column 5 will not always be p4. It relies on the assumption that there's a name line starting every item/entry, and that all other a:b lines have a value b to be stored. This should be a good start to put it into another structure should you need so. The format is truly special, more of a report structure, so I'd guess there's no suitable general purpose lib. Flat format is another similarly tricky old format type for which there are libraries - I've used it when calculating how much money each swedish participator in the interrail program should receive. Tricky business but fun! :-)
The code:
import re
import csv
with open('input.txt') as f:
lines = f.readlines()
f.close()
entries = []
entry = []
for line in lines:
parts = re.split(r':', line)
if len(parts) >= 2:
label = parts[0]
value = parts[1].strip()
if label.startswith('name'):
print('got name: ' + value)
# start new entry with the name as first value
entry = [value]
entries.append(entry)
else:
print('got value: ' + value)
entry.append(value)
print('collected {} entries'.format(len(entries)))
with open('output.csv', 'w', newline='') as output:
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerows(entries)

TypeError: '_csv.reader' object is not subscriptable and days passed [duplicate]

I'm trying to parse through a csv file and extract the data from only specific columns.
Example csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
I'm trying to capture only specific columns, say ID, Name, Zip and Phone.
Code I've looked at has led me to believe I can call the specific column by its corresponding number, so ie: Name would correspond to 2 and iterating through each row using row[2] would produce all the items in column 2. Only it doesn't.
Here's what I've done so far:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="#" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
and I'm expecting that this will print out only the specific columns I want for each row except it doesn't, I get the last column only.
The only way you would be getting the last column from this code is if you don't include your print statement in your for loop.
This is most likely the end of your code:
for row in reader:
content = list(row[i] for i in included_cols)
print content
You want it to be this:
for row in reader:
content = list(row[i] for i in included_cols)
print content
Now that we have covered your mistake, I would like to take this time to introduce you to the pandas module.
Pandas is spectacular for dealing with csv files, and the following code would be all you need to read a csv and save an entire column into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
so if you wanted to save all of the info in your column Names into a variable, this is all you need to do:
names = df.Names
It's a great module and I suggest you look into it. If for some reason your print statement was in for loop and it was still only printing out the last column, which shouldn't happen, but let me know if my assumption was wrong. Your posted code has a lot of indentation errors so it was hard to know what was supposed to be where. Hope this was helpful!
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
With a file like
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
Will output
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
Or alternatively if you want numerical indexing for the columns:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
To change the deliminator add delimiter=" " to the appropriate instantiation, i.e reader = csv.reader(f,delimiter=" ")
Use pandas:
import pandas as pd
my_csv = pd.read_csv(filename)
column = my_csv.column_name
# you can also use my_csv['column_name']
Discard unneeded columns at parse time:
my_filtered_csv = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
P.S. I'm just aggregating what other's have said in a simple manner. Actual answers are taken from here and here.
You can use numpy.loadtext(filename). For example if this is your database .csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
And you want the Name column:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
More easily you can use genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
With pandas you can use read_csv with usecols parameter:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
Example:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3
Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
Answer: Let's say you have the first table in a csv file (you can also load directly from the database using petl). Then you would simply load it and do the following.
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')
I think there is an easier way
import pandas as pd
dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values
So in here iloc[:, 0], : means all values, 0 means the position of the column.
in the example below ID will be selected
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
import pandas as pd
csv_file = pd.read_csv("file.csv")
column_val_list = csv_file.column_name._ndarray_values
Thanks to the way you can index and subset a pandas dataframe, a very easy way to extract a single column from a csv file into a variable is:
myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']
A few things to consider:
The snippet above will produce a pandas Series and not dataframe.
The suggestion from ayhan with usecols will also be faster if speed is an issue.
Testing the two different approaches using %timeit on a 2122 KB sized csv file yields 22.8 ms for the usecols approach and 53 ms for my suggested approach.
And don't forget import pandas as pd
If you need to process the columns separately, I like to destructure the columns with the zip(*iterable) pattern (effectively "unzip"). So for your example:
ids, names, zips, phones = zip(*(
(row[1], row[2], row[6], row[7])
for row in reader
))
import pandas as pd
dataset = pd.read_csv('Train.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
X is a a bunch of columns, use it if you want to read more that one column
y is single column, use it to read one column
[:, 1:-1] are [row_index : to_row_index, column_index : to_column_index]
SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']
import csv
with open('input.csv', encoding='utf-8-sig') as csv_file:
# the below statement will skip the first row
next(csv_file)
reader= csv.DictReader(csv_file)
Time_col ={'Time' : []}
#print(Time_col)
for record in reader :
Time_col['Time'].append(record['Time'])
print(Time_col)
From CSV File Reading and Writing you can import csv and use this code:
with open('names.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['first_name'], row['last_name'])
To fetch column name, instead of using readlines() better use readline() to avoid loop & reading the complete file & storing it in the array.
with open(csv_file, 'rb') as csvfile:
# get number of columns
line = csvfile.readline()
first_item = line.split(',')

How to import a file with data and pass it to a list in python?

I have a file with data .dat where inside it, they have 3 columns with values ​​referring to certain quantities, given the form:
apr.dat
| Mass density | Pressure | Energy density |
|:---- |:------:| -----:|
|2.700000e-02 |1.549166e-11|2.700000e-02 |
|2.807784e-02 |1.650004e-11|2.807784e-02 |
|2.919872e-02 |1.757406e-11|2.919872e-02 |
|3.036433e-02 |1.871798e-11|3.036433e-02 |
|3.157648e-02 |1.993637e-11|3.157648e-02 |
|3.283702e-02 |2.123406e-11|3.283702e-02 |
|3.414788e-02 |2.261622e-11|3.414788e-02 |
...
I just want to use the second and third column of data (without using the title). I was able to open the file using
data = open(r"C:\Users\Ramos\PycharmProjects\pythonProject\\apr.dat")
print(data.read())
And then, I tried to turn it into a list with the following code:
import numpy as np
data = open(r"C:\Users\Ramos\PycharmProjects\pythonProject\\apr.dat")
data2 = np.shape(data)
print(data2[1])
But when I tried to insert the numbers of column 2 and column 3 in a list, it gave an error. Is there an easier way to do this?
Thanks for any help.
I think there is no need at this point to use numpy.
with open(r"C:\Users\Ramos\PycharmProjects\pythonProject\apr.dat", 'r') as f:
reader = csv.reader(f, delimiter='\t')
next(reader) # skip 1st line
array = [[float(row[1]), float(row[2])] for row in reader] # skip column 0 (1st col)
EDIT: if you want separate lists for x and y:
x, y = list(zip(*array))

python csv | If row = x, print column containing x

I am new to python programming, pardon me if I make any mistakes. I am writing a python script to read a csv file and print out the required cell of the column if it contains the information in the row.
| A | B | C
---|----|---|---
1 | Re | Mg| 23
---|----|---|---
2 | Ra | Fe| 90
For example, I if-else the row C for value between 20 to 24. Then if the condition passes, it will return Cell A1 (Re) as the result.
At the moment, i only have the following and i have no idea how to proceed from here on.
f = open( 'imageResults.csv', 'rU' )
for line in f:
cells = line.split( "," )
if(cells[2] >= 20 and cells[2] <= 24):
f.close()
This might contain the answer to my question but i can't seem to make it work.
UPDATE
If in the row, there is a header, how do i get it to work? I wanted to change the condition to string but it don't work if I want to search for a range of values.
| A | B | C
---|----|---|---
1 |Name|Lat|Ref
---|----|---|---
2 | Re | Mg| 23
---|----|---|---
3 | Ra | Fe| 90
You should use a csv reader. It's built into python so there's no dependencies to install. Then you need to tell python that the third column is an integer. Something like this will do it:
import csv
with open('data.csv', 'rb') as f:
for line in csv.reader(f):
if 20 <= int(line[2]) <= 24:
print(line)
With this data in data.csv:
Re,Mg,23
Ra,Fe,90
Ha,Ns,50
Ku,Rt,20
the output will be:
$ python script.py
['Re', 'Mg', '23']
['Ku', 'Rt', '20']
Update:
If in the [first] row, there is a header, how do i get it to work?
There's csv.DictReader which is for that. Indeed it is safer to work with DictReader, especially when the order of the columns might change or you insert a column before the third column. Given this data in data.csv
Name,Lat,Ref
Re,Mg,23
Ra,Fe,90
Ha,Ns,50
Ku,Rt,20
Then is this the python script:
import csv
with open('data.csv', 'rb') as f:
for line in csv.DictReader(f):
if 20 <= int(line['Ref']) <= 24:
print(line)
P.S. Welcome at python. It's a good language for learning to program

Python - Print list of CSV strings in aligned columns

I have written a fragment of code that is fully compatible with both Python 2 and Python 3. The fragment that I wrote parses data and it builds the output as a list of CSV strings.
The script provides an option to:
write the data to a CSV file, or
display it to the stdout.
While I could easily iterate through the list and replace , with \t when displaying to stdout (second bullet option), the items are of arbitrary length, so don't line up in a nice format due to variances in tabs.
I have done quite a bit of research, and I believe that string format options could accomplish what I'm after. That said, I can't seem to find an example that helps me get the syntax correct.
I would prefer to not use an external library. I am aware that there are many options available if I went that route, but I want the script to be as compatible and simple as possible.
Here is an example:
value1,somevalue2,value3,reallylongvalue4,value5,superlongvalue6
value1,value2,reallylongvalue3,value4,value5,somevalue6
Can you help me please? Any suggestion will be much appreciated.
import csv
from StringIO import StringIO
rows = list(csv.reader(StringIO(
'''value1,somevalue2,value3,reallylongvalue4,value5,superlongvalue6
value1,value2,reallylongvalue3,value4,value5,somevalue6''')))
widths = [max(len(row[i]) for row in rows) for i in range(len(rows[0]))]
for row in rows:
print(' | '.join(cell.ljust(width) for cell, width in zip(row, widths)))
Output:
value1 | somevalue2 | value3 | reallylongvalue4 | value5 | superlongvalue6
value1 | value2 | reallylongvalue3 | value4 | value5 | somevalue6
def printCsvStringListAsTable(csvStrings):
# convert to list of lists
csvStrings = map(lambda x: x.split(','), csvStrings)
# get max column widths for printing
widths = []
for idx in range(len(csvStrings[0])):
columns = map(lambda x: x[idx], csvStrings)
widths.append(
len(
max(columns, key = len)
)
)
# print the csv strings
for row in csvStrings:
cells = []
for idx, col in enumerate(row):
format = '%-' + str(widths[idx]) + "s"
cells.append(format % (col))
print ' |'.join(cells)
if __name__ == '__main__':
printCsvStringListAsTable([
'col1,col2,col3,col4',
'val1,val2,val3,val4',
'abadfafdm,afdafag,aadfag,aadfaf',
])
Output:
col1 |col2 |col3 |col4
val1 |val2 |val3 |val4
abadfafdm |afdafag |aadfag |aadfaf
The answer by Alex Hall is definitely better and a terse form of the same code which I have written.

Categories

Resources