Using numpy.genfromtxt to read a csv file with strings containing commas

Using numpy.genfromtxt to read a csv file with strings containing commas - python

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
the code
np.genfromtxt('t.csv', delimiter=',')
produces the error:
ValueError: Some errors were detected !
Line #2 (got 4 columns instead of 3)
The data structure I am looking for is:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
The default value is ". An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

The problem with the additional comma, np.genfromtxt does not deal with that.
One simple solution is to read the file with csv.reader() from python's csv module into a list and then dump it into a numpy array if you like.
If you really want to use np.genfromtxt, note that it can take iterators instead of files, e.g. np.genfromtxt(my_iterator, ...). So, you can wrap a csv.reader in an iterator and give it to np.genfromtxt.
That would go something like this:
import csv
import numpy as np
np.genfromtxt(("\t".join(i) for i in csv.reader(open('myfile.csv'))), delimiter="\t")
This essentially replaces on-the-fly only the appropriate commas with tabs.

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray:
import pandas
data = pandas.read_csv('file.csv').as_matrix()
Pandas will handle the "Lexington, KY" case correctly

Make a better function that combines the power of the standard csv module and Numpy's recfromcsv. For instance, the csv module has good control and customization of dialects, quotes, escape characters, etc., which you can add to the example below.
The example genfromcsv_mod function below reads in a complicated CSV file similar to what Microsoft Excel sees, which may contain commas within quoted fields. Internally, the function has a generator function that rewrites each row with tab delimiters.
import csv
import numpy as np
def recfromcsv_mod(fname, **kwargs):
def rewrite_csv_as_tab(fname):
with open(fname, newline='') as fp:
dialect = csv.Sniffer().sniff(fp.read(1024))
fp.seek(0)
for row in csv.reader(fp, dialect):
yield "\t".join(row)
return np.recfromcsv(
rewrite_csv_as_tab(fname), delimiter="\t", encoding=None, **kwargs)
# Use it to read a CSV file into a record array
x = recfromcsv_mod("t.csv", case_sensitive=True)

You can try this code. We are reading .csv file from np.genfromtext()
method
Code:
myfile = np.genfromtxt('MyData.csv', delimiter = ',')
myfile = myfile.astype('int64')
print(myfile)
Output:
[[ 1 1 1 1 1 1 1 1 1 1 1]
[ 3 3 3 3 3 3 3 3 3 3 3]
[ 3 3 3 3 3 3 3 3 3 3 3]
[ 4 4 4 4 4 4 4 4 4 4 4]
[ 5 5 5 5 5 5 5 5 5 5 5]
[ 6 6 6 6 6 6 6 6 6 6 6]
[ 7 7 7 7 7 7 7 7 7 7 7]
[ 8 8 8 8 8 8 8 8 8 8 8]
[ 9 9 9 9 9 9 9 9 9 9 9]
[10 10 10 10 10 10 10 10 10 10 10]
[11 11 11 11 11 11 11 11 11 11 11]
[12 12 12 12 12 12 12 12 12 12 12]
[13 13 13 13 13 13 13 13 13 13 13]
[14 14 14 14 14 14 14 14 14 14 14]
[15 15 15 15 15 15 15 15 15 15 15]
[16 17 18 19 20 21 22 23 24 25 26]]
Input File "MyData.csv"

Related

Is there a way to traverse through a dask dataframe backwards?

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?

Assuming that the dataframe is indexed, the inversion of the index can be done as a two step process: invert the order of partitions and invert the index within each partition:
from dask.datasets import timeseries
ddf = timeseries()
ddf_inverted = (
ddf
.partitions[::-1]
.map_partitions(lambda df: df.sort_index(ascending=False))
)

If the last N rows are all in the last partition, you can use dask.dataframe.tail. If not, you can iterate backwards using the dask.dataframe.partitions attribute. This isn't particularly smart and will blow up your memory if you request too many rows, but it should do the trick:
def get_last_n(n, df):
read = []
lines_read = 0
for i in range(df.npartitions - 1, -1, -1):
p = df.partitions[i].tail(n - lines_read)
read.insert(0, p)
lines_read += len(p)
if lines_read >= n:
break
return pd.concat(read, axis=0)
For example, here's a dataframe with 20 rows and 5 partitions:
import dask.dataframe, pandas as pd, numpy as np, dask
df = dask.dataframe.from_pandas(pd.DataFrame({'A': np.arange(20)}), npartitions=5)
You can call the above function with any number of rows to get that many rows in the tail:
In [4]: get_last_n(4, df)
Out[4]:
A
16 16
17 17
18 18
19 19
In [5]: get_last_n(10, df)
Out[5]:
A
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
Requesting more rows than are in the dataframe just computes the whole dataframe:
In [6]: get_last_n(1000, df)
Out[6]:
A
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
Note that this requests the data iteratively, so may be very inefficient if your graph is complex and involves lots of shuffles.

np.random.choice exclude certain numbers?

I'm supposed to create code that will simulate a d20 sided dice rolling 25 times using np.random.choice.
I tried this:
np.random.choice(20,25)
but this still includes 0's which wouldn't appear on a dice.
How do I account for the 0's?

Use np.arange:
import numpy as np
np.random.seed(42) # for reproducibility
result = np.random.choice(np.arange(1, 21), 50)
print(result)
Output
[ 7 20 15 11 8 7 19 11 11 4 8 3 2 12 6 2 1 12 12 17 10 16 15 15
19 12 20 3 5 19 7 9 7 18 4 14 18 9 2 20 15 7 12 8 15 3 14 17
4 18]
The above code draws numbers from 0 to 20 both inclusive. To understand why, you could check the documentation of np.random.choice, in particular on the first argument:
a : 1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an
int, the random sample is generated as if a was np.arange(n)

np.random.choice() takes as its first argument an array of possible choices (if int is given it works like np.arrange), so you can use list(range(1, 21)) to get the output you want

+1
np.random.choice(20,25) + 1

reading a file that is detected as being one column

I have a file full of numbers in the form;
010101228522 0 31010 3 3 7 7 43 0 2 4 4 2 2 3 3 20.00 89165.30
01010222852313 3 0 0 7 31027 63 5 2 0 0 3 2 4 12 40.10 94170.20
0101032285242337232323 7 710153 9 22 9 9 9 3 3 4 80.52 88164.20
0101042285252313302330302323197 9 5 15 9 15 15 9 9 110.63 98168.80
01010522852617 7 7 3 7 31330 87 6 3 3 2 3 2 5 15 50.21110170.50
...
...
I am trying to read this file but I am not sure how to go about it, when I use the built in function open and loadtxt from numpy and i even tried converting to pandas but the file is read as one column, that is, its shape is (364 x 1) but I want it to separate the numbers to columns and the blank spaces to be replaced by zeros, any help would be appreciated. NOTE, some places there are two spaces following each other

If the columns content type is a string have you tried using str.split() This will turn the string into an array, then you have each number split up by each gap. You could then use a for loop for the amount of objects in the mentioned array to create a table out of it, not quite sure this has answered the question, sorry if not.
str.split():

So I finally solved my problem, I actually had to strip the lines and then read each "letter" from the line, in my case I am picking individual numbers from the stripped line and then appending them to an array. Here is the code for my solution;
arr = []
with open('Kp2001', 'r') as f:
for ii, line in enumerate(f):
arr.append([]) #Creates an n-d array
cnt = line.strip() #Strip the lines
for letter in cnt: #Get each 'letter' from the line, in my case it's the individual numbers
arr[ii].append(letter) #Append them individually so python does not read them as one string
df = pd.DataFrame(arr) #Then converting to DataFrame gives proper columns and actually keeps the spaces to their respectful columns
df2 = df.replace(' ', 0) #Replace the spaces with what you will

Python combine rows from different files into one data file

I have distributed information over multiple large csv files.
I want to combine all the files into one new file such as the first row from the first file is combined to the first row from the other file etc.
file1.csv
A,B
A,C
A,D
file2.csv
F,G
H,I
J,K
expected result:
output.csv
A,B,F,G
A,C,H,I
A,D,J,K
so consider I have an array ['file1.csv', 'file2.csv', ...] How to go from here ?
I tried to load each file into the memory and combine by np.column_stack but my files are too large to fit in memory.

Not pretty code, but this should work.
I'm not using with(open'filename','r') as myfile for the inputs. It could get a bit messy with 50 files, so these are opened and closed explicitly.
It opens each file then places the handle in a list. The first handle is taken as the master file, then we iterate through it line-by-line, each time reading one line from all the other open files and joining them with ',' then output that to the output file.
Note that if the other files have more lines, they won't be included. If any have less lines, this will raise an exception. I'll leave it to you to deal with these situations gracefully.
Note also that you can use glob to create filelist if the names follow a logical pattern (thanks to N. Wouda, below)
filelist = ['book1.csv','book2.csv','book3.csv','book4.csv']
openfiles = []
for filename in filelist:
openfiles.append(open(filename,'rb'))
# Use first file in the list as the master
# All files must have same number of lines (or greater)
masterfile = openfiles.pop(0)
with (open('output.csv','w')) as outputfile:
for line in masterfile:
outputlist = [line.strip()]
for openfile in openfiles:
outputlist.append(openfile.readline().strip())
outputfile.write(str.join(',', outputlist)+'\n')
masterfile.close()
for openfile in openfiles:
openfile.close()
Input Files
a b c d e f
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
Output
a b c d e f a b c d e f a b c d e f a b c d e f
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
7 8 9 10 11 12 7 8 9 10 11 12 7 8 9 10 11 12 7 8 9 10 11 12
13 14 15 16 17 18 13 14 15 16 17 18 13 14 15 16 17 18 13 14 15 16 17 18

Instead of completely reading the files into the memory you can iterate over them line by line.
from itertools import izip # like zip but gives us an iterator
with open('file1.csv') as f1, open('file2.csv') as f2, open('output.csv', 'w') as out:
for f1line, f2line in izip(f1, f2):
out.write('{},{}'.format(f1line.strip(), f2line))
Demo:
$ cat file1.csv
A,B
A,C
A,D
$ cat file2.csv
F,G
H,I
J,K
$ python2.7 merge.py
$ cat output.csv
A,B,F,G
A,C,H,I
A,D,J,K

Python, printing two columns of numbers using for loop

I need some very basic help with Python 3.3. I'm trying to get a better understanding of formatting using a for loop and I want to simply print out the odd numbers from 1-20 in two columns.
Here is what I've tried:
for col1 in range(1,10,2):
for col2 in range(11,20,2):
print(col1,'\t',col2)
For some reason my output is very strange. The left column has the odd numbers from 1-10, but each number is listed five times before it goes to the next number
1 11
1 13
1 15
1 17
1 19
3 11
3 13
3 15
3 17
3 19
etc..
What i want is:
1 11
3 13
5 15
7 17
9 19

You should do it using zip:
for i,j in zip(range(1,10,2), range(11,20,2)):
print('{}\t{}'.format(i,j))
[OUTPUT]
1 11
3 13
5 15
7 17
9 19
When you use nested loops, the problem is that you are printing the second column for each number in the first column, which is not what you want. Instead, you want to iterate through them simultaneously. That is where zip comes in handy.

You do not need a second for-loop or zip here. Instead, all you need is this:
>>> for n in range(1, 10, 2):
... print(n, '\t', n + 10)
...
1 11
3 13
5 15
7 17
9 19
>>>
It works because the numbers in the second column are simply those in the first plus 10.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using numpy.genfromtxt to read a csv file with strings containing commas - python

If you are using a numpy you probably want to work with numpy.ndarray. This will give you a numpy.ndarray: import pandas data = pandas.read_csv('file.csv').as_matrix() Pandas will handle the "Lexington, KY" case correctly

Related

Is there a way to traverse through a dask dataframe backwards?

np.random.choice exclude certain numbers?

reading a file that is detected as being one column

Python combine rows from different files into one data file

Python, printing two columns of numbers using for loop

Categories

Resources