Dataframe empty when passing column names - python

I am facing issue where on passing numpy array to dataframe without column names initializes it properly. Whereas, if I pass column names, it is empty.
x = np.array([(1, '1'), (2, '2')], dtype = 'i4,S1')
df = pd.DataFrame(x)
In []: df
Out[]:
f0 f1
0 1 1
1 2 2
df2 = pd.DataFrame(x, columns=['a', 'b'])
In []: df2
Out[]:
Empty DataFrame
Columns: [a, b]
Index: []

I think you need specify column names in parameter dtype, see DataFrame from structured or record array:
x = np.array([(1, '1'), (2, '2')], dtype=[('a', 'i4'),('b', 'S1')])
df2 = pd.DataFrame(x)
print (df2)
a b
0 1 b'1'
1 2 b'2'
Another solution without parameter dtype:
x = np.array([(1, '1'), (2, '2')])
df2 = pd.DataFrame(x, columns=['a', 'b'])
print (df2)
a b
0 1 1
1 2 2

It's the dtype param, without specifiying it, it works as expected.
See the example at documentation DataFrame
import numpy as np
import pandas as pd
x = np.array([(1, "11"), (2, "22")])
df = pd.DataFrame(x)
print df
df2 = pd.DataFrame(x, columns=['a', 'b'])
print df2

Related

How to deal with multi-index in pandas dataframe?

I have the following dataframe:
[1]: https://i.stack.imgur.com/3gvRa.png
which has three sub-levels of indices. The first one is a number of 4 digits, the second is a date, and the last one is an index from the original dataframe.
I want to rename each level by ['District_ID', 'Month'] and to drop the third level. I already tried to drop the last one, and I used:
DF.index = DF.index.droplevel(2)
As a result, the third level is gone but the second one duplicates for all the rows. In addition I want to rename the column indices. How can I accomplish these tasks?
I think you have a mistaken impression of what is going on. Take this simple sample frame
midx = pd.MultiIndex.from_tuples(
(lev_0, lev_1, lev_2) for lev_0, lev_1 in zip("ab", "xy") for lev_2 in range(2)
)
df = pd.DataFrame({"col": range(4)}, index=midx)
col
a x 0 0
1 1
b y 0 2
1 3
and look at the result of
print(df.index)
df = df.droplevel(2)
print(df.index)
MultiIndex([('a', 'x', 0),
('a', 'x', 1),
('b', 'y', 0),
('b', 'y', 1)],
)
MultiIndex([('a', 'x'),
('a', 'x'),
('b', 'y'),
('b', 'y')],
)
This should be exactly what you want? If you print the df after the droplevel it looks as if there's something strange happening with the first level, but this is only for making the print clearer.
As for the renaming:
df.index.names = ["lev_0", "lev_1"]
or
df.index.set_names(["lev_0", "lev_1"], inplace=True)
both lead to
col
lev_0 lev_1
a x 0
x 1
b y 2
y 3
Or if you want to rename the columns (not clear to me what you are looking for), then you could do
df.columns = ["new_col"]
or
df = df.rename(columns={"col": "new_col"})
new_col
a x 0
x 1
b y 2
y 3

Make (column id, value) tuples from pandas dataframe values in a list of lists

I'd like to convert every value in a pandas dataframe to a tuple of the form: (col_id, val) where col_id is the integer order of the column and val is the very value at that location, and output that in the form a list of lists which ignores the tuples whose val==0.
Example:
0 1 2 3
document0001 48 0 3 0
document0002 0 4 0 0
Output:
[[(0,48), (2,3)],
[(1,4)]]
I think I can iterate or write a custom function with apply to make the tuples but there has to be a better way.
This does it
transpose and calc a new column which picks out none zero value
iterate other the columns of transposed dataframe.
import pandas as pd
import numpy as np
import io
df = pd.read_csv(io.StringIO(""" 0 1 2 3
document0001 48 0 3 0
document0002 0 4 0 0"""), sep="\s\s+", engine="python")
dft = df.T
l = []
for c in dft.columns:
l.append(list(dft.loc[dft[c]!=0,c].to_frame().itertuples(name=None)))
l
output
[[('0', 48), ('2', 3)], [('1', 4)]]
Read the data:
df = pd.DataFrame([[48, 0, 3, 0], [0, 4, 0, 0]], index=['document0001', 'document0002'], columns=['0', '1', '2', '3'])
Turn rows into columns (transpose):
ndf = df.T
Merge columns to one value, assuming that there is always one value a zero. This method does not work if there are no zeroes:
sdf =pd.DataFrame(ndf['document0001']+ndf['document0002']).reset_index(drop=True)
Turn the values in a list of tuples:
sdf[(sdf.T != 0).any()].to_records()
Expected output:
rec.array([(0, 48), (1, 4), (2, 3)],
dtype=[('index', '<i8'), ('0', '<i8')])

Pandas apply function on dataframe over multiple columns

When I run the following code I get an KeyError: ('a', 'occurred at index a'). How can I apply this function, or something similar, over the Dataframe without encountering this issue?
Running python3.6, pandas v0.22.0
import numpy as np
import pandas as pd
def add(a, b):
return a + b
df = pd.DataFrame(np.random.randn(3, 3),
columns = ['a', 'b', 'c'])
df.apply(lambda x: add(x['a'], x['c']))
I think need parameter axis=1 for processes by rows in apply:
axis: {0 or 'index', 1 or 'columns'}, default 0
0 or index: apply function to each column
1 or columns: apply function to each row
df = df.apply(lambda x: add(x['a'], x['c']), axis=1)
print (df)
0 -0.802652
1 0.145142
2 -1.160743
dtype: float64
You don't even need apply, you can directly add the columns. The output will be a series either way:
df = df['a'] + df['c']
for example:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
df = df['a'] + df['c']
print(df)
# 0 6
# 1 8
# dtype: int64
you can try this
import numpy as np
import pandas as pd
def add(df):
return df.a + df.b
df = pd.DataFrame(np.random.randn(3, 3),
columns = ['a', 'b', 'c'])
df.apply(add, axis =1)
where of course you can substitute any function that takes as inputs the columns of df.

set values in dataframe based on columns in other dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(5, 3), columns=['X','Y','Z'])
I can easily set the values in df to zero if they are less than a constant:
df[df < 0.0] = 0.0
can someone tell me how to instead compare to a column in a different dataframe? I assumed this would work, but it does not:
df[df < df2.X] = 0.0
IIUC you need to use lt and pass axis=0 to compare column-wise:
In [83]:
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(5, 3), columns=['X','Y','Z'])
df
Out[83]:
A B C
0 2.410659 -1.508592 -1.626923
1 -1.550511 0.983712 -0.021670
2 1.295553 -0.388102 0.091239
3 2.179568 2.266983 0.030463
4 1.413852 -0.109938 1.232334
In [87]:
df2
Out[87]:
X Y Z
0 0.267544 0.355003 -1.478263
1 -1.419736 0.197300 -1.183842
2 0.049764 -0.033631 0.343932
3 -0.863873 -1.361624 -1.043320
4 0.219959 0.560951 1.820347
In [86]:
df[df.lt(df2.X, axis=0)] = 0
df
Out[86]:
A B C
0 2.410659 0.000000 0.000000
1 0.000000 0.983712 -0.021670
2 1.295553 0.000000 0.091239
3 2.179568 2.266983 0.030463
4 1.413852 0.000000 1.232334

Map List of Tuples to New Column

Suppose I have a pandas.DataFrame:
In [76]: df
Out[76]:
a b c
0 -0.685397 0.845976 w
1 0.065439 2.642052 x
2 -0.220823 -2.040816 y
3 -1.331632 -0.162705 z
Suppose I have a list of tuples:
In [78]: tp
Out[78]: [('z', 0.25), ('y', 0.33), ('x', 0.5), ('w', 0.75)]
I would like to map tp do df such that the the second element in each tuple lands in a new column that corresponds with the row matching the first element in each tuple.
The end result would look like this:
In [87]: df2
Out[87]:
a b c new
0 -0.685397 0.845976 w 0.75
1 0.065439 2.642052 x 0.50
2 -0.220823 -2.040816 y 0.33
3 -1.331632 -0.162705 z 0.25
I've tried using lambdas, pandas.applymap, pandas.map, etc but cannot seem to crack this one. So for those that will point out I have not actually asked a question, how would I map tp do df such that the the second element in each tuple lands in a new column that corresponds with the row matching the first element in each tuple?
You need to turn your list of tuples into a dict which is ridiculously easy to do in python, then call map on it:
In [4]:
df['new'] = df['c'].map(dict(tp))
df
Out[4]:
a b c new
index
0 -0.685397 0.845976 w 0.75
1 0.065439 2.642052 x 0.50
2 -0.220823 -2.040816 y 0.33
3 -1.331632 -0.162705 z 0.25
The docs for map show that that it takes as a function arg a dict, series or function.
applymap takes a function as an arg but operates element wise on the whole dataframe which is not what you want to do in this case.
The online docs show how to apply an operation element wise, as does the excellent book
Does this example help?
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
>>> d = {'col1': ts1, 'col2': ts2}
>>> df = DataFrame(data=d, index=index)
>>> df2 = DataFrame(np.random.randn(10, 5))
>>> df3 = DataFrame(np.random.randn(10, 5),
... columns=['a', 'b', 'c', 'd', 'e'])

Categories

Resources