I would like to import a .dat file which includes
lines/header/numbers/lines
something like this example
start using data to calculate something
x y z g h
1 4 6 8 3
4 5 6 8 9
2 3 6 8 5
end the data that I should import.
Now I am trying to read this file, remove first and last lines and put the numbers in an array and do some basic calculation on them, But I could not get rid of the lines. I used data = np.genfromtxt('sample.dat') to import data, but with lines, I cannot do anything. Can anyone help me?
Maybe this helps you:
import numpy as np
data = np.genfromtxt('sample.dat',
skip_header=1,
skip_footer=1,
names=True,
dtype=None,
delimiter=' ')
print(data)
# Output: [(1, 4, 6, 8, 3) (4, 5, 6, 8, 9) (2, 3, 6, 8, 5)]
Please refer to the numpy documentation for further information about the parameters used: https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html
Related
I have a tab delimited file and I wish I to read all col headers but the last 2 columns will have just one column header.
Example 1st row of file:
xx yy zz ii jj
5 5 10 2 a d
In my example, that will be colheader = jj and values will be a and d which spans 2 tabs. I tried with genfromtxt but it gives:
ValueError: Some errors were detected !
Line #2 (got 6 columns instead of 5).
I wish I can use numpy's genfromtxt due to my prior code but
any method will do right now. It seems difficult to use genfromtxt.
I expect a tuple of rows. At one point I got
[(5, 5, 10, 2, b'a') for 1st row but I wish I can get [(5, 5, 10, 2, ['a','d']) if possible
Thank you
I have 2D array:
import numpy as np
output = np.array([1,1,6])*np.arange(6)[:,None]+1
output
Out[32]:
array([[ 1, 1, 1],
[ 2, 2, 7],
[ 3, 3, 13],
[ 4, 4, 19],
[ 5, 5, 25],
[ 6, 6, 31]])
I tried to use np.savetxt('file1.txt', output, fmt='%10d')
i have got the result in one line only
How can I save it in txt file simillar to :
x y z
1 1 1
2 2 7
3 3 13
4 4 19
5 5 25
6 6 31
3 separate columns, each column has name (x,y,z)
Please note: the original array too large (40000000 rows and 3 columns), I am using Python 3.6
I have tried the solutions in here and here but, it does not work with me
Noor, let me guess - you are using windows notepad to view the file?
I use Notepad++ which is smart enough to understand Unix-style-Lineendings which are used (by default) when creating files by np.savetxt() even when operated under windows.
You might want to explicitly specify newline="\r\n" when calling savetxt.
np.savetxt('file1.txt', output, fmt='%10d' ,header= " x y z", newline="\r\n")
Doku: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.savetxt.html
I am not sure about your data, but this:
import numpy as np
output = np.array([1,1,6])*np.arange(60)[:,None]+1
print(output)
np.savetxt('file1.txt', output, fmt='%10d' ,header= " x y z")
Produces this output:
# x y z
1 1 1
2 2 7
3 3 13
=== snipped a few lines ===
58 58 343
59 59 349
60 60 355
for me.
for np.arange(1000000) its about 32MB big and similarly formatted...
for np.arange(10000000) its about 322MB big and similarly formatted...
willem-van-onsem 1+Gb was far closer.
I did not account for the spacing of fixed 10 chars per number, my bad.
Pandas read_table function is missing some lines in a file I'm trying to read and I can't find out why.
import pandas as pd
import numpy as np
filename = "whatever.txt"
df_pd = pd.read_table(filename, use_cols=['FirstColumn'], skip_blank_lines=False)
df_np = np.genfromtxt(filename, usecols=0)
#function to count file line by line
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_pd = len(df_pd)
len_np = len(df_np)
len_linebyline = file_len(filename)
Unfortunately I can't share my actual data because its a huge file, 30 columns x 58 million rows besides being protected by licensing. For some reason the numpy and file_len methods give the correct length of ~58 million rows but the pandas method only has ~55 million.
Does anyone have any ideas as to what could be causing this or how I could investigate it?
Using the following approach you can try to find the missing data:
In [31]: df = pd.DataFrame({'col':[0,1,2,3,4,6,7,8]})
In [32]: a = np.arange(10)
In [33]: df
Out[33]:
col
0 0
1 1
2 2
3 3
4 4
5 6
6 7
7 8
In [34]: a
Out[34]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [35]: np.setdiff1d(a, df.col)
Out[35]: array([5, 9])
I have some files to parse. It has time info followed by label and value if it has been modified on that time frame. A very simple example is like:
Time 1:00
a 1
b 1
c 2
d 4
Time 2:00
d 2
a 4
c 5
e 7
Time 3:00
c 3
Time 4:00
e 3
a 2
b 5
I need to put this into CSV file so I will plot afterwards. The CSV file should look like
Time, a, b, c, d, e
1:00, 1, 1, 2, 4, 0
2:00, 4, 1, 5, 4, 7
3:00, 4, 1, 3, 4, 7
4:00, 2, 5, 3, 4, 3
Also I need to find the max value of each labels so I can sort my graphs.
Since max values are a:4, b:5, c:5, d:4, e:7, I like to have list such as:
['e', 'b', 'c', 'a', 'd' ]
What I am doing is going through the log once and reading all labels since I don't know what labels can it be.
Then going through second time in the whole file to parse. My algorithm is simply:
for label in labelList:
currentValues[label] = 0
maxValues[item] = 0
for line in content:
if endOfCurrentTimeStamp:
put_current_values_to_CSV()
else:
label = line.split()[0]
value = line.split()[1]
currentValues[label] = value
if maxValues[label] < value:
maxValues[label] = value
I got the maxValues of each label in the dictionary. Then what should I do to have a list of sorted from max to min values as said above?
Also let me know if you think an easier way to do the whole thing?
By the way my data is big. I am talking about this input file can easily be hundreds of megabytes with thousands of different labels. So every time I finish a time frame, I put data to CSV.
Regards
Dictionaries are by nature unsorted so you will have to convert it to a different data type.
This is probably a little inefficient but you could try the following:
to_sort = []
for key in maxValues:
to_sort.append((maxValues[key], key))
to_sort.sort()
A list of tuples will sort based upon the first object if I'm not mistaken
If the tuples won't sort try using itemgetter
Use pandas once you've created your CSV. I'll emulate your file with StringIO; you'll feed read_csv a real file name:
df = pandas.read_csv(io.StringIO("""Time, a, b, c, d, e
1:00, 1, 1, 2, 4, 0
2:00, 4, 1, 5, 4, 7
3:00, 4, 1, 3, 4, 7
4:00, 2, 5, 3, 4, 3"""), index_col=0)
df.apply(max).sort(inplace=False)
Output:
a 4
d 4
b 5
c 5
e 7
dtype: int64
Plotting is easy too:
df.plot()
I would like to load a table in numpy, so that the first row and first column would be considered text labels. Something equivalent to this R code:
read.table("filename.txt", row.header=T)
Where the file is a delimited text file like this:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
So that read in I will have an array:
[[5,4,3,2],
[1,0,9,9],
[8,7,6,5]]
With some sort of:
rownames ["X","Y","Z"]
colnames ["A","B","C","D"]
Is there such a class / mechanism?
Numpy arrays aren't perfectly suited to table-like structures. However, pandas.DataFrames are.
For what you're wanting, use pandas.
For your example, you'd do
data = pandas.read_csv('filename.txt', delim_whitespace=True, index_col=0)
As a more complete example (using StringIO to simulate your file):
from StringIO import StringIO
import pandas as pd
f = StringIO("""A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5""")
x = pd.read_csv(f, delim_whitespace=True, index_col=0)
print 'The DataFrame:'
print x
print 'Selecting a column'
print x['D'] # or "x.D" if there aren't spaces in the name
print 'Selecting a row'
print x.loc['Y']
This yields:
The DataFrame:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
Selecting a column
X 2
Y 9
Z 5
Name: D, dtype: int64
Selecting a row
A 1
B 0
C 9
D 9
Name: Y, dtype: int64
Also, as #DSM pointed out, it's very useful to know about things like DataFrame.values or DataFrame.to_records() if you do need a "raw" numpy array. (pandas is built on top of numpy. In a simple, non-strict sense, each column of a DataFrame is stored as a 1D numpy array.)
For example:
In [2]: x.values
Out[2]:
array([[5, 4, 3, 2],
[1, 0, 9, 9],
[8, 7, 6, 5]])
In [3]: x.to_records()
Out[3]:
rec.array([('X', 5, 4, 3, 2), ('Y', 1, 0, 9, 9), ('Z', 8, 7, 6, 5)],
dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8'), ('D', '<i8')])