Pandas read_table() missing lines - python

Pandas read_table function is missing some lines in a file I'm trying to read and I can't find out why.
import pandas as pd
import numpy as np
filename = "whatever.txt"
df_pd = pd.read_table(filename, use_cols=['FirstColumn'], skip_blank_lines=False)
df_np = np.genfromtxt(filename, usecols=0)
#function to count file line by line
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_pd = len(df_pd)
len_np = len(df_np)
len_linebyline = file_len(filename)
Unfortunately I can't share my actual data because its a huge file, 30 columns x 58 million rows besides being protected by licensing. For some reason the numpy and file_len methods give the correct length of ~58 million rows but the pandas method only has ~55 million.
Does anyone have any ideas as to what could be causing this or how I could investigate it?

Using the following approach you can try to find the missing data:
In [31]: df = pd.DataFrame({'col':[0,1,2,3,4,6,7,8]})
In [32]: a = np.arange(10)
In [33]: df
Out[33]:
col
0 0
1 1
2 2
3 3
4 4
5 6
6 7
7 8
In [34]: a
Out[34]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [35]: np.setdiff1d(a, df.col)
Out[35]: array([5, 9])

Related

Unique values in pandas

Hi so I've just started learning python.And I am trying to learn pandas and I have this doubt on how to find the unique start and stop values in a data frame.Can someone help me out here
As you did not provide an example dataset, let's assume this one:
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'start': np.random.randint(0,10,5),
'stop': np.random.randint(0,10,5),
}).T.apply(sorted).T
start stop
0 0 5
1 1 8
2 7 9
3 5 6
4 0 9
To get unique values for a given column (here start):
>>> df['start'].unique()
array([0, 1, 7, 5])
For all columns at once:
>>> df.apply(pd.unique, result_type='reduce')
start [0, 1, 7, 5]
stop [5, 8, 9, 6]
dtype: object

Converting a list with no tuples into a data frame

Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9

Python Dask map_partitions

Probably a continuation of this question, working from the dask docs examples for map_partitions.
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)
from random import randint
def myadd(df):
new_value = df.x + randint(1,4)
return new_value
res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res
In the above code, randint is only being called once, not once per row as I would expect. How come?
Output:
X Y Z
1 1 4
2 2 5
3 3 6
4 4 7
5 5 8
If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.
If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:
df.x.map(lambda x: x + random.randint(1, 4))
or
df.x + np.random.randint(1, 4, size=len(df.x))
If you replace your newvalue = line with one of these, it will work as expected.

index into list of pandas series

I have a list of pandas Series objects obj and a list of indices idx. What I want is a new Series out that for each row in idx contains the value of obj[idx] if idx is not 255 and -1 otherwise.
The following code does what I want to achieve, but I'd like to know if there's a better way of doing this, especially without the overhead of first creating a Python list and then converting that into a pandas series.
>>> import pandas as pd
>>> obj = [pd.Series([1, 2, 3]), pd.Series([4, 5, 6]), pd.Series([7, 8, 9])]
>>> idx = pd.Series([0, 255, 2])
>>> out = pd.Series([obj[idx[row]][row] if idx[row] != 255 else -1 for row in range(len(idx))])
>>> out
0 1
1 -1
2 9
dtype: int64
>>>
Thanks in advance.
Usingreindex + lookup
pd.Series(pd.concat(obj,1).reindex(idx).lookup(idx,idx.index)).fillna(-1)
Out[822]:
0 1.0
1 -1.0
2 9.0
dtype: float64

Import .dat file in Python 3

I would like to import a .dat file which includes
lines/header/numbers/lines
something like this example
start using data to calculate something
x y z g h
1 4 6 8 3
4 5 6 8 9
2 3 6 8 5
end the data that I should import.
Now I am trying to read this file, remove first and last lines and put the numbers in an array and do some basic calculation on them, But I could not get rid of the lines. I used data = np.genfromtxt('sample.dat') to import data, but with lines, I cannot do anything. Can anyone help me?
Maybe this helps you:
import numpy as np
data = np.genfromtxt('sample.dat',
skip_header=1,
skip_footer=1,
names=True,
dtype=None,
delimiter=' ')
print(data)
# Output: [(1, 4, 6, 8, 3) (4, 5, 6, 8, 9) (2, 3, 6, 8, 5)]
Please refer to the numpy documentation for further information about the parameters used: https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html

Categories

Resources