Pandas' `to_csv` doesn't behave the same way as printing - python

Consider the following sequence of operations:
Create a data frame with two columns with the following types int64, float64
Create a new frame by converting all columns to object
Inspect the new data frame
Persist the new data frame
Expect the second column to get persisted as shown in the 3rd step: i.e. as string, not as float64
Illustrated below:
# Step 1
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
# Step 2
df2 = df.astype(object)
# Step 3
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 4 non-null object
1 b 4 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
# NOTE notice how column `b` is rendered
df2
a b
0 3 1
1 2 500.43
2 1 256.13
3 0 5
# Step 4
df2.to_csv("/tmp/df2", index=False, sep="\t")
Now let us inspect the generated output:
$ cat df2
a b
3 1.0
2 500.43
1 256.13
0 5.0
Notice how column b is persisted: the decimal places are still present for round numbers even though the datatype is object. Why does this happen? What am I missing here?
I'm using Pandas 1.1.2 with Python 3.7.9.

I think, 'object' is NumPy/pandas dtype and not one of the python data types.
If you run:
type(df2.iloc[0,1])
before step 4, you will get 'float' data type even though it's been already changed to 'object'.
You can use:
df.to_csv("df.csv",float_format='%g', index=False, sep="\t")
instead of casting in step 2.

I am not great with pandas and still learning. I looked at a few solution and thought why not do an apply on the data before we send it to csv file.
Here's what I did to get the values printed as 1 and 5 instead of 1.0 and 5.0
values in df are mix of string, float, ints
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 's', 't'], 'b': [1, 500.43, 256.13, 5, 'txt']})
df2 = df.astype(object)
def convert(x):
a = []
for i in x.to_list():
a.append(coerce(i))
return pd.Series(a)
#return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
def coerce(y):
try:
p = float(y)
q = int(y)
if p != q:
return str(p)
else:
return str(q)
except:
return str(y)
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
Output in the file will be:
a b
3 1
2 500.43
1 256.13
s 5
t txt
all values in df are numeric (integers or floats)
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
df2 = df.astype(object)
def convert(x):
return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
The output is as follows:
a b
3 1
2 500.43
1 256.13
0 5
Here I am assuming all values in df2 are numeric. If it has a string value, then int(i) will fail.

Related

pandas combine nested dataframes into one single dataframe

I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?
You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>
This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)
Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18

zip two columns of a dataframe without changing the datatype

The problem is simple. Here we have a dataframe with a specified datatype for columns:
df = pd.DataFrame({'A':[1,2], 'B':[3,4]})
df.A = df.A.astype('int16')
#df
A B
0 1 3
1 2 4
#df.dtypes
A int16
B int64
dtype: object
Now I zip two columns A and B into a tuple:
df['C'] = list(zip(df.A, df.B))
A B C
0 1 3 (1, 3)
1 2 4 (2, 4)
However, now the data type of values in column C are changed.
type(df.C[0][0])
#int
type(df.A[0])
#numpy.int16
How can I zip two columns and keep the datatype of each value inside the tuples, so that type(df.C[0][0]) would be int16 (same as type(df.A[0]))?
I think some type casting is happening when you refer as df.A, etc. See https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tolist.html
Return a copy of the array data as a (nested) Python list. Data items
are converted to the nearest compatible builtin Python type, via the
item function.
But this worked
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2], 'B':[3,4]})
>>> df.A = df.A.astype('int16')
>>> df['C'] = list(zip(df.A.values, df.B.values))
>>> df
A B C
0 1 3 (1, 3)
1 2 4 (2, 4)
>>> type(df.C[0][0])
<class 'numpy.int16'>
>>> type(df.C[0][1])
<class 'numpy.int64'>

Python pandas - count elements which have a number in every row [duplicate]

What is the best way to account for (not a number) nan values in a pandas DataFrame?
The following code:
import numpy as np
import pandas as pd
dfd = pd.DataFrame([1, np.nan, 3, 3, 3, np.nan], columns=['a'])
dfv = dfd.a.value_counts().sort_index()
print("nan: %d" % dfv[np.nan].sum())
print("1: %d" % dfv[1].sum())
print("3: %d" % dfv[3].sum())
print("total: %d" % dfv[:].sum())
Outputs:
nan: 0
1: 1
3: 3
total: 4
While the desired output is:
nan: 2
1: 1
3: 3
total: 6
I am using pandas 0.17 with Python 3.5.0 with Anaconda 2.4.0.
To count just null values, you can use isnull():
In [11]:
dfd.isnull().sum()
Out[11]:
a 2
dtype: int64
Here a is the column name, and there are 2 occurrences of the null value in the column.
If you want to count only NaN values in column 'a' of a DataFrame df, use:
len(df) - df['a'].count()
Here count() tells us the number of non-NaN values, and this is subtracted from the total number of values (given by len(df)).
To count NaN values in every column of df, use:
len(df) - df.count()
If you want to use value_counts, tell it not to drop NaN values by setting dropna=False (added in 0.14.1):
dfv = dfd['a'].value_counts(dropna=False)
This allows the missing values in the column to be counted too:
3 3
NaN 2
1 1
Name: a, dtype: int64
The rest of your code should then work as you expect (note that it's not necessary to call sum; just print("nan: %d" % dfv[np.nan]) suffices).
A good clean way to count all NaN's in all columns of your dataframe would be ...
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
print(df.isna().sum().sum())
Using a single sum, you get the count of NaN's for each column. The second sum, sums those column sums.
This one worked for me best!
If you wanna get a simple summary use (great for data science to count missing values and their type):
df.info(verbose=True, null_counts=True)
Or another cool one is:
df['<column_name>'].value_counts(dropna=False)
Example:
df = pd.DataFrame({'a': [1, 2, 1, 2, np.nan],
...: 'b': [2, 2, np.nan, 1, np.nan],
...: 'c': [np.nan, 3, np.nan, 3, np.nan]})
This is the df:
a b c
0 1.0 2.0 NaN
1 2.0 2.0 3.0
2 1.0 NaN NaN
3 2.0 1.0 3.0
4 NaN NaN NaN
Run Info:
df.info(verbose=True, null_counts=True)
...:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
a 4 non-null float64
b 3 non-null float64
c 2 non-null float64
dtypes: float64(3)
So you see for C you get, out of 5 rows 2 non-nulls, b/c you have null at rows: [0,2,4]
And this is what you get using value_counts for each column:
In [17]: df['a'].value_counts(dropna=False)
Out[17]:
2.0 2
1.0 2
NaN 1
Name: a, dtype: int64
In [18]: df['b'].value_counts(dropna=False)
Out[18]:
NaN 2
2.0 2
1.0 1
Name: b, dtype: int64
In [19]: df['c'].value_counts(dropna=False)
Out[19]:
NaN 3
3.0 2
Name: c, dtype: int64
if you only want the summary of null value for each column, using the following code
df.isnull().sum()
if you want to know how many null values in the data frame using following code
df.isnull().sum().sum() # calculate total
Yet another way to count all the nans in a df:
num_nans = df.size - df.count().sum()
Timings:
import timeit
import numpy as np
import pandas as pd
df_scale = 100000
df = pd.DataFrame(
[[1, np.nan, 100, 63], [2, np.nan, 101, 63], [2, 12, 102, 63],
[2, 14, 102, 63], [2, 14, 102, 64], [1, np.nan, 200, 63]] * df_scale,
columns=['group', 'value', 'value2', 'dummy'])
repeat = 3
numbers = 100
setup = """import pandas as pd
from __main__ import df
"""
def timer(statement, _setup=None):
print (min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
timer('df.size - df.count().sum()')
timer('df.isna().sum().sum()')
timer('df.isnull().sum().sum()')
prints:
3.998805362999999
3.7503365439999996
3.689461442999999
so pretty much equivalent
dfd['a'].isnull().value_counts()
return :
(True 695
False 60,
Name: a, dtype: int64)
True : represents the null values count
False : represent the non-null values count

Prevent coercion of pandas data frames while indexing and inserting rows

I'm working with individual rows of pandas data frames, but I'm stumbling over coercion issues while indexing and inserting rows. Pandas seems to always want to coerce from a mixed int/float to all-float types, and I can't see any obvious controls on this behaviour.
For example, here is a simple data frame with a as int and b as float:
import pandas as pd
pd.__version__ # '0.25.2'
df = pd.DataFrame({'a': [1], 'b': [2.2]})
print(df)
# a b
# 0 1 2.2
print(df.dtypes)
# a int64
# b float64
# dtype: object
Here is a coercion issue while indexing one row:
print(df.loc[0])
# a 1.0
# b 2.2
# Name: 0, dtype: float64
print(dict(df.loc[0]))
# {'a': 1.0, 'b': 2.2}
And here is a coercion issue while inserting one row:
df.loc[1] = {'a': 5, 'b': 4.4}
print(df)
# a b
# 0 1.0 2.2
# 1 5.0 4.4
print(df.dtypes)
# a float64
# b float64
# dtype: object
In both instances, I want the a column to remain as an integer type, rather than being coerced to a float type.
After some digging, here are some terribly ugly workarounds. (A better answer will be accepted.)
A quirk found here is that non-numeric columns stops coercion, so here is how to index one row to a dict:
dict(df.assign(_='').loc[0].drop('_', axis=0))
# {'a': 1, 'b': 2.2}
And inserting a row can be done by creating a new data frame with one row:
df = df.append(pd.DataFrame({'a': 5, 'b': 4.4}, index=[1]))
print(df)
# a b
# 0 1 2.2
# 1 5 4.4
Both of these tricks are not optimised for large data frames, so I would greatly appreciate a better answer!
Whenever you are getting data from dataframe or appending data to a dataframe and need to keep the data type same, avoid conversion to other internal structures which are not aware of the data types needed.
When you do df.loc[0] it converts to pd.Series,
>>> type(df.loc[0])
<class 'pandas.core.series.Series'>
And now, Series will only have a single dtype. Thus coercing int to float.
Instead keep structure as pd.DataFrame,
>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>
Select row needed as a frame and then convert to dict
>>> df.loc[[0]].to_dict(orient='records')
[{'a': 1, 'b': 2.2}]
Similarly, to add a new row, Use pandas pd.DataFrame.append function,
>>> df = df.append([{'a': 5, 'b': 4.4}]) # NOTE: To append as a row, use []
a b
0 1 2.2
0 5 4.4
The above will not cause type conversion,
>>> df.dtypes
a int64
b float64
dtype: object
The root of the problem is that
The indexing of pandas dataframe returns a pandas series
We can see that:
type(df.loc[0])
# pandas.core.series.Series
And a series can only have one dtype, in your case either int64 or float64.
There are two workarounds come to my head:
print(df.loc[[0]])
# this will return a dataframe instead of series
# so the result will be
# a b
# 0 1 2.2
# but the dictionary is hard to read
print(dict(df.loc[[0]]))
# {'a': 0 1
# Name: a, dtype: int64, 'b': 0 2.2
# Name: b, dtype: float64}
or
print(df.astype(object).loc[0])
# this will change the type of value to object first and then print
# so the result will be
# a 1
# b 2.2
# Name: 0, dtype: object
print(dict(df.astype(object).loc[0]))
# in this way the dictionary is as expected
# {'a': 1, 'b': 2.2}
When you append a dictionary to a dataframe, it will convert the dictionary to a Series first and then append. (So the same problem happens again)
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6973
if isinstance(other, dict):
other = Series(other)
So your walkaround is actually a solid one, or else we could:
df.append(pd.Series({'a': 5, 'b': 4.4}, dtype=object, name=1))
# a b
# 0 1 2.2
# 1 5 4.4
A different approach with slight data manipulations:
Assume you have a list of dictionaries (or dataframes)
lod=[{'a': [1], 'b': [2.2]}, {'a': [5], 'b': [4.4]}]
where each dictionary represents a row (note the lists in the second dictionary). Then you can create a dataframe easily via:
pd.concat([pd.DataFrame(dct) for dct in lod])
a b
0 1 2.2
0 5 4.4
and you maintain the types of the columns. See concat
So if you have a dataframe and a list of dicts, you could just use
pd.concat([df] + [pd.DataFrame(dct) for dct in lod])
In the first case, you can work with the nullable integer data type. The Series selection doesn't coerce to float and values are placed in an object container. The dictionary is then properly created, with the underlying value stored as a np.int64.
df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')
d = dict(df.loc[0])
#{'a': 1, 'b': 2.2}
type(d['a'])
#numpy.int64
With your syntax, this almost works for the second case too, but this upcasts to object, so not great:
df.loc[1] = {'a': 5, 'b': 4.4}
# a b
#0 1 2.2
#1 5 4.4
df.dtypes
#a object
#b float64
#dtype: object
However, we can make a small change to the syntax for adding a row at the end (with a RangeIndex) and now types are dealt with properly.
df = pd.DataFrame({'a': [1], 'b': [2.2]})
df['a'] = df['a'].astype('Int64')
df.loc[df.shape[0], :] = [5, 4.4]
# a b
#0 1 2.2
#1 5 4.4
df.dtypes
#a Int64
#b float64
#dtype: object

Append an empty row in dataframe using pandas

I am trying to append an empty row at the end of dataframe but unable to do so, even trying to understand how pandas work with append function and still not getting it.
Here's the code:
import pandas as pd
excel_names = ["ARMANI+EMPORIO+AR0143-book.xlsx"]
excels = [pd.ExcelFile(name) for name in excel_names]
frames = [x.parse(x.sheet_names[0], header=None,index_col=None).dropna(how='all') for x in excels]
for f in frames:
f.append(0, float('NaN'))
f.append(2, float('NaN'))
There are two columns and random number of row.
with "print f" in for loop i Get this:
0 1
0 Brand Name Emporio Armani
2 Model number AR0143
4 Part Number AR0143
6 Item Shape Rectangular
8 Dial Window Material Type Mineral
10 Display Type Analogue
12 Clasp Type Buckle
14 Case Material Stainless steel
16 Case Diameter 31 millimetres
18 Band Material Leather
20 Band Length Women's Standard
22 Band Colour Black
24 Dial Colour Black
26 Special Features second-hand
28 Movement Quartz
Add a new pandas.Series using pandas.DataFrame.append().
If you wish to specify the name (AKA the "index") of the new row, use:
df.append(pandas.Series(name='NameOfNewRow'))
If you don't wish to name the new row, use:
df.append(pandas.Series(), ignore_index=True)
where df is your pandas.DataFrame.
You can add it by appending a Series to the dataframe as follows. I am assuming by blank you mean you want to add a row containing only "Nan".
You can first create a Series object with Nan. Make sure you specify the columns while defining 'Series' object in the -Index parameter.
The you can append it to the DF. Hope it helps!
from numpy import nan as Nan
import pandas as pd
>>> df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
... 'B': ['B0', 'B1', 'B2', 'B3'],
... 'C': ['C0', 'C1', 'C2', 'C3'],
... 'D': ['D0', 'D1', 'D2', 'D3']},
... index=[0, 1, 2, 3])
>>> s2 = pd.Series([Nan,Nan,Nan,Nan], index=['A', 'B', 'C', 'D'])
>>> result = df1.append(s2)
>>> result
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 NaN NaN NaN NaN
You can add a new series, and name it at the same time. The name will be the index of the new row, and all the values will automatically be NaN.
df.append(pd.Series(name='Afterthought'))
Assuming df is your dataframe,
df_prime = pd.concat([df, pd.DataFrame([[np.nan] * df.shape[1]], columns=df.columns)], ignore_index=True)
where df_prime equals df with an additional last row of NaN's.
Note that pd.concat is slow so if you need this functionality in a loop, it's best to avoid using it.
In that case, assuming your index is incremental, you can use
df.loc[df.iloc[-1].name + 1,:] = np.nan
Append "empty" row to data frame and fill selected cells:
Generate empty data frame (no rows just columns a and b):
import pandas as pd
col_names = ["a","b"]
df = pd.DataFrame(columns = col_names)
Append empty row at the end of the data frame:
df = df.append(pd.Series(), ignore_index = True)
Now fill the empty cell at the end (len(df)-1) of the data frame in column a:
df.loc[[len(df)-1],'a'] = 123
Result:
a b
0 123 NaN
And of course one can iterate over the rows and fill cells:
col_names = ["a","b"]
df = pd.DataFrame(columns = col_names)
for x in range(0,5):
df = df.append(pd.Series(), ignore_index = True)
df.loc[[len(df)-1],'a'] = 123
Result:
a b
0 123 NaN
1 123 NaN
2 123 NaN
3 123 NaN
4 123 NaN
The code below worked for me.
df.append(pd.Series([np.nan]), ignore_index = True)
Assuming your df.index is sorted you can use:
df.loc[df.index.max() + 1] = None
It handles well different indexes and column types.
[EDIT] it works with pd.DatetimeIndex if there is a constant frequency, otherwise we must specify the new index exactly e.g:
df.loc[df.index.max() + pd.Timedelta(milliseconds=1)] = None
long example:
df = pd.DataFrame([[pd.Timestamp(12432423), 23, 'text_field']],
columns=["timestamp", "speed", "text"],
index=pd.DatetimeIndex(start='2111-11-11',freq='ms', periods=1))
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1 entries, 2111-11-11 to 2111-11-11
Freq: L
Data columns (total 3 columns):
timestamp 1 non-null datetime64[ns]
speed 1 non-null int64
text 1 non-null object
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 32.0+ bytes
df.loc[df.index.max() + 1] = None
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2 entries, 2111-11-11 00:00:00 to 2111-11-11 00:00:00.001000
Data columns (total 3 columns):
timestamp 1 non-null datetime64[ns]
speed 1 non-null float64
text 1 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 64.0+ bytes
df.head()
timestamp speed text
2111-11-11 00:00:00.000 1970-01-01 00:00:00.012432423 23.0 text_field
2111-11-11 00:00:00.001 NaT NaN NaN
You can also use:
your_dataframe.insert(loc=0, value=np.nan, column="")
where loc is your empty row index.
#Dave Reikher's answer is the best solution.
df.loc[df.iloc[-1].name + 1,:] = np.nan
Here's a similar answer without the NumPy library
df.loc[len(df.index)] = ['' for x in df.columns.values.tolist()]
len(df.index) = number of rows. Always 1 more than index count.
By using df.loc[len(df.index)] you are selecting the next index number (row) available.
df.iloc[-1].name + 1 equals df.loc[len(df.index)]
Instead of using NumPy, you can also use a python comprehension
Create a list from the column names: df.columns.values.tolist()
Create a new list of empty strings '' based on the number of columns.
['' for x in df.columns.values.tolist()]

Categories

Resources