Pandas Rounds int64 number when loading dictionaries - python

I am loading a list of dictionaries into a pandas dataframe, i.e. if d is my list of dicts, simply:
pd.DataFrame(d)
Unfortunately, one value in the dictionary is a 64-bit integer. It is getting converted to float because some dictionaries don't have a value for this column and are therefore given NaN values, thereby converting the entire column to a float.
For example:
col1
0 NaN
1 NaN
2 NaN
3 0.000000e+00
4 1.506758e+18
5 1.508758e+18
If I try to fillna all the NaNs to zero then recast the column astype(np.int64) returns values that are all slightly off (due to rounding). How can I avoid this and keep my original 64-bit values intact?

Demo:
In [10]: d
Out[10]: {'a': [1506758000000000000, nan, 1508758000000000000]}
Naive approach:
In [11]: pd.DataFrame(d)
Out[11]:
a
0 1.506758e+18
1 NaN
2 1.508758e+18
Workaround (pay attention at dtype=str):
In [12]: pd.DataFrame(d, dtype=str).fillna(0).astype(np.int64)
Out[12]:
a
0 1506758000000000000
1 0
2 1508758000000000000

To my knowledge there is no way to override the inference here, you will need to fill the missing values before passing to pandas. Something like this:
d = [{'col1': 1}, {'col2': 2}]
cols_to_check = ['col1']
for row in d:
for col in cols_to_check:
if col not in row:
row[col] = 0
d
Out[39]: [{'col1': 1}, {'col1': 0, 'col2': 2}]
pd.DataFrame(d)
Out[40]:
col1 col2
0 1 NaN
1 0 2.0

You can create a series with comprehension and unstack with a fill_value parameter
pd.Series(
{(i, j): v for i, x in enumerate(d)
for j, v in x.items()},
dtype=np.int64
).unstack(fill_value=0)

Related

pandas: cannot set column with substring extracted from other column

I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.
Here is some example code that illustrates the problem I am facing:
import pandas as pd
data = [
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'},
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'}
]
df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')
print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))
The output I get from the above code is as follows:
df:
type base_col derived_col
0 A key=val NaN
1 B other_val NaN
2 A key=val NaN
3 B other_val NaN
mask:
0 True
1 False
2 True
3 False
Name: type, dtype: bool
extraction:
0
0 val
2 val
The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.
Please clarify what I am getting wrong here. Thanks!
You should assign the serise not df , check the column should pick 0
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]
df
Out[449]:
type base_col derived_col
0 A key=val val
1 B other_val NaN
2 A key=val val
3 B other_val NaN

Reshaping DataFrame with pandas

So I'm working with pandas on python. I collect data indexed by timestamps with multiple ways.
This means I can have one index with 2 features available (and the others with NaN values, it's normal) or all features, it depends.
So my problem is when I add some data with multiple values for the same indices, see the example below :
Imagine this is the set we're adding new data :
Index col1 col2
1 a A
2 b B
3 c C
This the data we will add:
Index new col
1 z
1 y
Then the result is this :
Index col1 col2 new col
1 a A NaN
1 NaN NaN z
1 NaN NaN y
2 b B NaN
3 c C NaN
So instead of that, I would like the result to be :
Index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.
I don't know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.
This solution assumes the data that you need to add is a series.
Original df:
df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
Data to add (series):
s = pd.Series(['x','y'],index = [1,1])
Solution:
df.join(s.to_frame()
.assign(cc = lambda x: x.groupby(level=0)
.cumcount().add(1))
.set_index('cc',append=True)[0]
.unstack()
.rename('New Col{}'.format,axis=1))
Output:
A B C New Col1 New Col2
1 1 2 2 x y
2 0 1 2 NaN NaN
3 2 2 0 NaN NaN
Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don't mention new col 1 and new col 2 in the declaration of the subsequent arrays.
Here's my declarations of your data frames:
d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)
e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)
e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)
They look like this:
index new col1
1 z
and this:
index new col2
1 y
Notice that I declare your new columns as part of the data frames. Once they're declared like that, it's just a matter of merging:
dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")
And the output looks like this:
index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).
Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123's answer.
# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])
Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, ...):
c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list
# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)
# Merge
a.join(d, how='outer')
Output:
col1 col2 New col1 New col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won't deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:
new col
1 z
1 y
2 x
The merged output will be:
col1 col2 New col1 New col2
1 a A z y
2 b B x None
3 c C NaN NaN

Get list of types in dataframe columns, skipping NaN cells

Let's say I have this dataframe:
df
col1 col2 col3 col4
1 apple NaN apple
2 NaN False 1.3
NaN orange True NaN
I'd like to get a list of all types in each column, excluding the NaN/null cells. Output Could be as a dictionary like this:
{'col1': int, 'col2': str, 'col3':bool, 'col4': [str,float]}
I've gotten as far as creating a dictionary that outputs all the strings in each column including the NaN values. I'm not sure how to exclude the NaNs.
output = {}
for col in df.columns.values.tolist():
list_types = [x.__name__ for x in df[col].apply(type).unique()]
output[col] = list_types
The code above would get me almost what I want, but with a bunch of extra "float"s for the NaNs:
{'col1': [int,float], 'col2': [str,float], 'col3':[bool,float], 'col4': [str,float]}
For excluding nan, do
df = df.dropna()
Then for getting data types:
df.dtypes
In the below approach, what I have done is that I extracted the Non-NAN items to a list and then found the dtype of the remaining list:
#initialize the empty column
output={}
#loop over the columns
for column in df:
a=[x for x in df[column] if str(x)!= 'nan']
output[column]=type(a[0])
Try with stack, this will drop the NaN , then we do groupby + unqiue
df.stack().apply(lambda x : type(x).__name__).groupby(level=1).unique().to_dict()
{'col1': array(['float'], dtype=object), 'col2': array(['str'], dtype=object), 'col3': array(['bool'], dtype=object), 'col4': array(['str'], dtype=object)}

Sum columns in a pandas dataframe which contain a string

I am trying to do something relatively simple in summing all columns in a pandas dataframe that contain a certain string. Then making that a new column in the dataframe from the sum. These columns are all numeric float values...
I can get the list of columns which contain the string I want
StmCol = [col for col in cdf.columns if 'Stm_Rate' in col]
But when I try to sum them using:
cdf['PadStm'] = cdf[StmCol].sum()
I get a new column full of "nan" values.
You need to pass in axis=1 to .sum, by default (axis=0) sums over each column:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df[["A"]].sum() # Here I'm passing the list of columns ["A"]
Out[13]:
A 4
dtype: int64
In [14]: df[["A"]].sum(axis=1)
Out[14]:
0 1
1 3
dtype: int64
Only the latter matches the index of df:
In [15]: df["C"] = df[["A"]].sum()
In [16]: df["D"] = df[["A"]].sum(axis=1)
In [17]: df
Out[17]:
A B C D
0 1 2 NaN 1
1 3 4 NaN 3

returning aggregated dataframe from pandas groupby

I'm trying to wrap my head around Pandas groupby methods. I'd like to write a function that does some aggregation functions and then returns a Pandas DataFrame. Here's a grossly simplified example using sum(). I know there are easier ways to do simple sums, in real life my function is more complex:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2':[1.0, 2, 3, 4]})
In [3]: df
Out[3]:
col1 col2
0 A 1
1 A 2
2 B 3
3 B 4
def func2(df):
dfout = pd.DataFrame({ 'col1' : df['col1'].unique() ,
'someData': sum(df['col2']) })
return dfout
t = df.groupby('col1').apply(func2)
In [6]: t
Out[6]:
col1 someData
col1
A 0 A 3
B 0 B 7
I did not expect to have col1 in there twice nor did I expect that mystery index looking thing. I really thought I would just get col1 & someData.
In my real life application I'm grouping by more than one column and really would like to get back a DataFrame and not a Series object.
Any ideas for a solution or an explanation on what Pandas is doing in my example above?
----- added info -----
I should have started with this example, I think:
In [13]: import pandas as pd
In [14]: df = pd.DataFrame({'col1':['A','A','A','B','B','B'], 'col2':['C','D','D','D','C','C'], 'col3':[.1,.2,.4,.6,.8,1]})
In [15]: df
Out[15]:
col1 col2 col3
0 A C 0.1
1 A D 0.2
2 A D 0.4
3 B D 0.6
4 B C 0.8
5 B C 1.0
In [16]: def func3(df):
....: dfout = sum(df['col3']**2)
....: return dfout
....:
In [17]: t = df.groupby(['col1', 'col2']).apply(func3)
In [18]: t
Out[18]:
col1 col2
A C 0.01
D 0.20
B C 1.64
D 0.36
In the above illustration the result of the apply() function is a Pandas Series. And it lacks the groupby columns from the df.groupby. The essence of what I'm struggling with is how do I create a function which I apply to a groupby which returns both the result of the function AND the columns on which it was grouped?
----- yet another update ------
It appears that if I then do this:
pd.DataFrame(t).reset_index()
I get back a dataframe which is really close to what I was after.
The reason you are seeing the columns with 0s is because the output of .unique() is an array.
The best way to understand how your apply is going to work is to inspect each action group-wise:
In [11] :g = df.groupby('col1')
In [12]: g.get_group('A')
Out[12]:
col1 col2
0 A 1
1 A 2
In [13]: g.get_group('A')['col1'].unique()
Out[13]: array([A], dtype=object)
In [14]: sum(g.get_group('A')['col2'])
Out[14]: 3.0
The majority of the time you want this to be an aggregated value.
The output of grouped.apply will always have the group labels as an index (the unique values of 'col1'), so your example construction of col1 seems a little obtuse to me.
Note: To pop 'col1' (the index) back to a column you can call reset_index, so in this case.
In [15]: g.sum().reset_index()
Out[15]:
col1 col2
0 A 3
1 B 7

Categories

Resources