appending columns produces NaN in pandas DataFrame - python

I need to add columns iteratively to a DataFrame object. This is a simplified version:
>>> x=DataFrame()
>>> for i in 'ps':
... x = x.append(DataFrame({i:[3,4]}))
...
>>> x
p s
0 3 NaN
1 4 NaN
0 NaN 3
1 NaN 4
What should I do to get:
p s
0 3 3
1 4 4
?

Your idea of creating the dict first is probably the best way:
>>> from pandas import *
>>> DataFrame({c: [1,2] for c in 'sp'})
p s
0 1 1
1 2 2
(here using dictionary comprehensions, available in Python 2.7). Just for completeness, though, you could -- inefficiently -- use join or concat to get a column-by-column approach to work:
>>> df = DataFrame()
>>> for c in 'sp':
... df = concat([df, DataFrame({c: [1,2]})], axis=1)
...
>>> print df
s p
0 1 1
1 2 2
>>>
>>> df = DataFrame()
>>> for c in 'sp':
... df = df.join(DataFrame({c: [1,2]}), how='outer')
...
>>> print df
s p
0 1 1
1 2 2
[You can see the difference in column order.] But your idea of building the dict and then constructing the DataFrame from the constructed dict is a much better approach.

Related

Mapping string list to pandas column?

Just curios better way to map pandas column against a list.
ref_list=['a','b','c','d']
lst = [0,2,1]
df = pd.DataFrame(lst,columns=['no'])
expected output
no map
0 0 a
1 2 c
2 1 b
map with a enumerated dictionary:
df['map_'] = df['no'].map(dict(enumerate(ref_list)))
#df['map_'] = np.array(ref_list)[lst]
print(df)
no map_
0 0 a
1 2 c
2 1 b
df = pd.DataFrame(zip(lst, np.array(ref_list)[lst]), columns=["no", "map"])
print(df)
Prints:
no map
0 0 a
1 2 c
2 1 b

Pandas DataFrame from a dict of ndarray

Consider a dictionary like the following:
>>> dict_temp = {'a': np.array([[0,1,2], [3,4,5]]),
'b': np.array([[3,4,5], [2,5,1], [5,3,7]])}
How can I build a pandas DataFrame out of this, using a multi-index with level 0 and 1 as follows:
level_0 = ['a', 'b']
level_1 = [[0,1], [0,1,2]]
I expect the code to build the multi-index levels itself... I don't care about the column names for now.
Appreciate comments...
Try concat:
pd.concat({k:pd.DataFrame(d) for k, d in dict_temp.items()})
Output:
0 1 2
a 0 0 1 2
1 3 4 5
b 0 3 4 5
1 2 5 1
2 5 3 7

Creating multiple dataframes from a dictionary in a loop

I have a dictionary like the below
d = {'a':'1,2,3','b':'3,4,5,6'}
I want to create dataframes from it in a loop, such as
a = 1,2,3
b = 3,4,5,6
Creating a single dataframe that can reference dictionary keys such as df['a'] does not work for what I am trying to achieve. Any suggestions?
Try this to get a list of dataframes:
>>> import pandas as pd
>>> import numpy as np
>>> dfs = [pd.DataFrame(np.array(b.split(',')), columns=list(a)) for a,b in d.items()]
gives the following output
>>> dfs[0]
a
0 1
1 2
2 3
>>> dfs[1]
b
0 3
1 4
2 5
3 6
To convert your dictionary into a list of DataFrames, run:
lst = [ pd.Series(v.split(','), name=k).to_frame()
for k, v in d.items() ]
Then, for your sample data, lst[0] contains:
a
0 1
1 2
2 3
and lst[1]:
b
0 3
1 4
2 5
3 6
Hope this helps:
dfs=[]
for key, value in d.items():
df = pd.DataFrame.from_dict((list(filter(None, value))))
dfs.append(df)

How to delete a rows pandas df

I am trying to remove a row in a pandas df plus the following row. For the df below I want to remove the row when the value in Code is equal to X. But I also want to remove the subsequent row as well.
import pandas as pd
d = ({
'Code' : ['A','A','B','C','X','A','B','A'],
'Int' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
If I use this code it removes the desired row. But I can't use the same for value A as there are other rows that contain A, which are required.
df = df[df.Code != 'X']
So my intended output is:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
4 B 4
5 A 5
I need something like df = df[df.Code != 'X'] +1
Using shift
df.loc[(df.Code!='X')&(df.Code.shift()!='X'),]
Out[99]:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5
You need to find the index of the element you want to delete, and then you can simply delete at that index twice:
>>> i = df[df.Code == 'X'].index
>>> df.drop(df.index[[i]], inplace=True)
>>> df.drop(df.index[[i]], inplace=True, errors='ignore')
>>> df
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5

Convert floats to ints in Pandas?

I've been working with data imported from a CSV. Pandas changed some columns to float, so now the numbers in these columns get displayed as floating points! However, I need them to be displayed as integers or without comma. Is there a way to convert them to integers or not display the comma?
To modify the float output do this:
df= pd.DataFrame(range(5), columns=['a'])
df.a = df.a.astype(float)
df
Out[33]:
a
0 0.0000000
1 1.0000000
2 2.0000000
3 3.0000000
4 4.0000000
pd.options.display.float_format = '{:,.0f}'.format
df
Out[35]:
a
0 0
1 1
2 2
3 3
4 4
Use the pandas.DataFrame.astype(<type>) function to manipulate column dtypes.
>>> df = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"))
>>> df
A B C D
0 0.542447 0.949988 0.669239 0.879887
1 0.068542 0.757775 0.891903 0.384542
2 0.021274 0.587504 0.180426 0.574300
>>> df[list("ABCD")] = df[list("ABCD")].astype(int)
>>> df
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
EDIT:
To handle missing values:
>>> df
A B C D
0 0.475103 0.355453 0.66 0.869336
1 0.260395 0.200287 NaN 0.617024
2 0.517692 0.735613 0.18 0.657106
>>> df[list("ABCD")] = df[list("ABCD")].fillna(0.0).astype(int)
>>> df
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
Considering the following data frame:
>>> df = pd.DataFrame(10*np.random.rand(3, 4), columns=list("ABCD"))
>>> print(df)
... A B C D
... 0 8.362940 0.354027 1.916283 6.226750
... 1 1.988232 9.003545 9.277504 8.522808
... 2 1.141432 4.935593 2.700118 7.739108
Using a list of column names, change the type for multiple columns with applymap():
>>> cols = ['A', 'B']
>>> df[cols] = df[cols].applymap(np.int64)
>>> print(df)
... A B C D
... 0 8 0 1.916283 6.226750
... 1 1 9 9.277504 8.522808
... 2 1 4 2.700118 7.739108
Or for a single column with apply():
>>> df['C'] = df['C'].apply(np.int64)
>>> print(df)
... A B C D
... 0 8 0 1 6.226750
... 1 1 9 9 8.522808
... 2 1 4 2 7.739108
To convert all float columns to int
>>> df = pd.DataFrame(np.random.rand(5, 4) * 10, columns=list('PQRS'))
>>> print(df)
... P Q R S
... 0 4.395994 0.844292 8.543430 1.933934
... 1 0.311974 9.519054 6.171577 3.859993
... 2 2.056797 0.836150 5.270513 3.224497
... 3 3.919300 8.562298 6.852941 1.415992
... 4 9.958550 9.013425 8.703142 3.588733
>>> float_col = df.select_dtypes(include=['float64']) # This will select float columns only
>>> # list(float_col.columns.values)
>>> for col in float_col.columns.values:
... df[col] = df[col].astype('int64')
>>> print(df)
... P Q R S
... 0 4 0 8 1
... 1 0 9 6 3
... 2 2 0 5 3
... 3 3 8 6 1
... 4 9 9 8 3
This is a quick solution in case you want to convert more columns of your pandas.DataFrame from float to integer considering also the case that you can have NaN values.
cols = ['col_1', 'col_2', 'col_3', 'col_4']
for col in cols:
df[col] = df[col].apply(lambda x: int(x) if x == x else "")
I tried with else x) and else None), but the result is still having the float number, so I used else "".
Use 'Int64' for NaN support
astype(int) and astype('int64') cannot handle missing values (numpy int)
astype('Int64') (note the capital I) can handle missing values (pandas int)
df['A'] = df['A'].astype('Int64') # capital I
This assumes you want to keep missing values as NaN. If you plan to impute them, you could fillna first as Ryan suggested.
Examples of 'Int64' (capital I)
If the floats are already rounded, just use astype:
df = pd.DataFrame({'A': [99.0, np.nan, 42.0]})
df['A'] = df['A'].astype('Int64')
# A
# 0 99
# 1 <NA>
# 2 42
If the floats are not rounded yet, round before astype:
df = pd.DataFrame({'A': [3.14159, np.nan, 1.61803]})
df['A'] = df['A'].round().astype('Int64')
# A
# 0 3
# 1 <NA>
# 2 2
To read int+NaN data from a file, use dtype='Int64' to avoid the need for converting at all:
csv = io.StringIO('''
id,rating
foo,5
bar,
baz,2
''')
df = pd.read_csv(csv, dtype={'rating': 'Int64'})
# id rating
# 0 foo 5
# 1 bar <NA>
# 2 baz 2
Notes
'Int64' is an alias for Int64Dtype:
df['A'] = df['A'].astype(pd.Int64Dtype()) # same as astype('Int64')
Sized/signed aliases are available:
lower bound
upper bound
'Int8'
-128
127
'Int16'
-32,768
32,767
'Int32'
-2,147,483,648
2,147,483,647
'Int64'
-9,223,372,036,854,775,808
9,223,372,036,854,775,807
'UInt8'
0
255
'UInt16'
0
65,535
'UInt32'
0
4,294,967,295
'UInt64'
0
18,446,744,073,709,551,615
Expanding on #Ryan G mentioned usage of the pandas.DataFrame.astype(<type>) method, one can use the errors=ignore argument to only convert those columns that do not produce an error, which notably simplifies the syntax. Obviously, caution should be applied when ignoring errors, but for this task it comes very handy.
>>> df = pd.DataFrame(np.random.rand(3, 4), columns=list('ABCD'))
>>> df *= 10
>>> print(df)
... A B C D
... 0 2.16861 8.34139 1.83434 6.91706
... 1 5.85938 9.71712 5.53371 4.26542
... 2 0.50112 4.06725 1.99795 4.75698
>>> df['E'] = list('XYZ')
>>> df.astype(int, errors='ignore')
>>> print(df)
... A B C D E
... 0 2 8 1 6 X
... 1 5 9 5 4 Y
... 2 0 4 1 4 Z
From pandas.DataFrame.astype docs:
errors : {‘raise’, ‘ignore’}, default ‘raise’
Control raising of exceptions on invalid data for provided dtype.
raise : allow exceptions to be raised
ignore : suppress exceptions. On error return original object
New in version 0.20.0.
The columns that needs to be converted to int can be mentioned in a dictionary also as below
df = df.astype({'col1': 'int', 'col2': 'int', 'col3': 'int'})
>>> import pandas as pd
>>> right = pd.DataFrame({'C': [1.002, 2.003], 'D': [1.009, 4.55], 'key': ['K0', 'K1']})
>>> print(right)
C D key
0 1.002 1.009 K0
1 2.003 4.550 K1
>>> right['C'] = right.C.astype(int)
>>> print(right)
C D key
0 1 1.009 K0
1 2 4.550 K1
In the text of the question is explained that the data comes from a csv. Só, I think that show options to make the conversion when the data is read and not after are relevant to the topic.
When importing spreadsheets or csv in a dataframe, "only integer columns" are commonly converted to float because excel stores all numerical values as floats and how the underlying libraries works.
When the file is read with read_excel or read_csv there are a couple of options avoid the after import conversion:
parameter dtype allows a pass a dictionary of column names and target types like dtype = {"my_column": "Int64"}
parameter converters can be used to pass a function that makes the conversion, for example changing NaN's with 0. converters = {"my_column": lambda x: int(x) if x else 0}
parameter convert_float will convert "integral floats to int (i.e., 1.0 –> 1)", but take care with corner cases like NaN's. This parameter is only available in read_excel
To make the conversion in an existing dataframe several alternatives have been given in other comments, but since v1.0.0 pandas has a interesting function for this cases: convert_dtypes, that "Convert columns to best possible dtypes using dtypes supporting pd.NA."
As example:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame(
...: {
...: "a": pd.Series([1, 2, 3], dtype=np.dtype("int64")),
...: "b": pd.Series([1.0, 2.0, 3.0], dtype=np.dtype("float")),
...: "c": pd.Series([1.0, np.nan, 3.0]),
...: "d": pd.Series([1, np.nan, 3]),
...: }
...: )
In [6]: df
Out[6]:
a b c d
0 1 1.0 1.0 1.0
1 2 2.0 NaN NaN
2 3 3.0 3.0 3.0
In [7]: df.dtypes
Out[7]:
a int64
b float64
c float64
d float64
dtype: object
In [8]: converted = df.convert_dtypes()
In [9]: converted.dtypes
Out[9]:
a Int64
b Int64
c Int64
d Int64
dtype: object
In [10]: converted
Out[10]:
a b c d
0 1 1 1 1
1 2 2 <NA> <NA>
2 3 3 3 3
Although there are many options here,
You can also convert the format of specific columns using a dictionary
Data = pd.read_csv('Your_Data.csv')
Data_2 = Data.astype({"Column a":"int32", "Column_b": "float64", "Column_c": "int32"})
print(Data_2 .dtypes) # Check the dtypes of the columns
This is an useful and very fast way to change the data format of specific columns for quick data analysis.

Categories

Resources