Does this occur because there is an NaN? - python

I have a list of floats, and when I try to convert it into series or dataframe
code
000001.SZ 1.305442
000002.SZ 1.771655
000004.SZ 2.649862
000005.SZ 1.373074
000006.SZ 1.115238
...
601512.SH 16.305734
688123.SH 53.395579
603995.SH 19.598881
688268.SH 70.174454
002972.SZ 19.644900
300811.SZ 24.042762
688078.SH 86.263280
603109.SH NaN
Length: 3753, dtype: float64
df = pd.DataFrame(data = mylist,columns = ["std_r_in20days"])
print(df)
s = pd.Series(mylist)
print(s)
The result is:
std_r_in20days
0 NaN
0 code
000001.SZ 1.305442
000002.SZ 1.77...
dtype: object
AttributeError: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')
Does this occur because there is an NaN in mylist? If so, how can I fix it? I don't want to delete the row with NaN but just leave them there

Ser=pd.DataFrame([[1,2,3,4,None],[2,3,4,5,np.nan]])
Ser=Ser.replace(np.nan,0)
You can do it like this. There are also other functions in pandas like fillna().:)

Instead of removing the whole row , you can just replace those values with 0 using pandas function
df.fillna(0)
Also, learn to perform null check in the beginning of your script using
df.isna().sum().sum()
This will give you the number of NaN values in your whole dataframe

Related

Using .str on Pandas Series converts all data to NaN float64 type

I'm trying to use .str on my Pandas series in order to use the string operator methods but the .str converts all my data to NaN of the float64 dtype. The Pandas series is an object dtype to begin with.
Below I show my API series and then try to perform the split method on it.
In [80]:
wellinfo['API'].head()
Out[80]:
0 3501124153
1 3501124154
2 3501124155
3 3501124185
4 3501725290
Name: API, dtype: object
In [81]:
wellinfo['API'].str.split("0")
Out[81]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
...
1537 NaN
1538 NaN
1539 NaN
1540 NaN
1541 NaN
1542 NaN
Name: API, Length: 1543, dtype: float64
I've skimmed through the Pandas documentation but cannot find out why it is converting everything. I've also tried multiple methods besides the split method with the same results.
Any information is appreciated. Thank you.
simply select the str values ​​and do the operation in this case replace. On the other values ​​that are not str returns NaN. Use Series.fillna:
wellinfo['API'].str.split("0").fillna(wellinfo['API'])
or to modify int values:
wellinfo['API'].astype(str).str.split("0")
as suggested #Mstaino

DataFrame from dicts with automatic date parsing

I am creating a Pandas DataFrame from sequence of dicts.
The dicts are large and somewhat heterogeneous.
Some of the fields are dates.
I would like to automatically detect and parse the date fields.
This can be achieved by
df0 = pd.Dataframe.from_dict(dicts)
df0.to_csv('tmp.csv', index=False)
df = pd.read_csv('tmp.csv', parse_dates=True)
I would like to find a more direct way to do this.
Use pd.to_datetime with errors='ignore'
Only use on columns of dtype == object using select_dtypes. This prevents converting numeric columns into nonsensical dates.
'ignore' abandons the conversion attempt if any errors are encountered.
combine_first is used instead of update because update keeps the initial dtypes. Since they were object, this would mess it all up.
df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore').combine_first(df)
date0 date1 feuxdate notadate
0 2019-01-01 NaT NaN NaN
1 NaT NaT 0.0 NaN
2 NaT NaT NaN hi
3 NaT 2019-02-01 NaN NaN
Could've also gotten tricky with it using assign to deal with dtypes
df.assign(**df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore'))
Setup
dicts = [
{'date0': '2019-01-01'},
{'feuxdate': 0},
{'notadate': 'hi'},
{'date1': '20190201'}
]
df = pd.DataFrame.from_dict(dicts)

Find closest valid numbers among missing values in a pandas dataframe

I've got a dataset with multiple missing sequences of varying lengths where I'd like to find the first valid numbers that occur before and after these sequences for some particular dates. In the sample dataset below, I would like to find the valid numbers for ColumnB that occur closest to the date 2018-11-26.
Datasample:
Date ColumnA ColumnB
2018-11-19 107.00 NaN
2018-11-20 104.00 NaN
2018-11-21 106.00 NaN
2018-11-22 105.24 80.00
2018-11-23 104.63 NaN
2018-11-26 104.62 NaN
2018-11-28 104.54 NaN
2018-11-29 103.91 86.88
2018-11-30 103.43 NaN
2018-12-01 106.13 NaN
2018-12-02 110.83 NaN
Expected output:
[80, 86.88]
Some details:
If it were the case that this particular sequence was the only one with missing values, I would have been able to solve it using For Loops, or the pandas functions first_valid_index() or isnull() as described in Pandas - find first non-null value in column, but that will rarely be the case.
I'm able to solve this using a few For Loops, but it's very slow for larger datasets and not very elegant, so I'd really like to hear other suggestions!
Try this way, get the index and slice to get the first valid number
idx= np.where(df['Date']=='2018-11-26')[0][0]
# idx 3
num = (df.loc[df.loc[:idx,'ColumnB'].first_valid_index(),'ColumnB'],
df.loc[df.loc[idx:,'ColumnB'].first_valid_index(),'ColumnB'])
num
(80.0, 86.879999999999995)
I'd try it this way:
import pandas as pd
import numpy as np
df_vld = df.dropna()
idx = np.argmin(abs(df_vld.index - pd.datetime(2018, 11,26)))
# 1
df_vld.loc[df_vld.index[idx]]
Out:
ColumnA 103.91
ColumnB 86.88
Name: 2018-11-29 00:00:00, dtype: float64
[df['ColumnB'].ffill().loc['2018-11-26'], df['ColumnB'].bfill().loc['2018-11-26']]
You can use ffill and bfill to create two columns with the value from before and after such as
df['before'] = df.ColumnB.ffill()
df['after'] = df.ColumnB.bfill()
then get the value for the dates you want with a loc
print (df.loc[df.Date == pd.to_datetime('2018-11-26'),['before','after']].values[0].tolist())
[80.0, 86.88]
and if you have a list of dates then you can use isin:
list_dates = ['2018-11-26','2018-11-28']
print (df.loc[df.Date.isin(pd.to_datetime(list_dates)),['before','after']].values.tolist())
[[80.0, 86.88], [80.0, 86.88]]
Here's a way to do it:
t = '2018-11-26'
Look for the index of the date t:
ix = df.loc[df.Date==t].index.values[0]
Keep positions of non-null values in ColumnB:
non_nulls = np.where(~df.ColumnB.isnull())[0]
Get the nearest non-null values both on top and bellow:
[df.loc[non_nulls[non_nulls < ix][-1],'ColumnB']] + [df.loc[non_nulls[non_nulls > ix][0],'ColumnB']]
[80.0, 86.88]

Pandas find max value from series of mixed data

Am using Pandas df and in dataframe i was able to extract to series named 'xy' that look like this:
INVOICE
2014-08-14 00:00:00
4557
Printing
nan
Item AMOUNT
nan
1 9.6
0
0
0
9.6
2
11.6
nan
nan
nan
what i need to find is maximum value which is usually located towards the end of 'xy' pandas series i tried to convert it to string i get into problems as some of the series are string not int or float i need a good way as i am writing this script for several different files
try to use pd.to_numeric
pd.to_numeric(xy, 'coerce').max()
4557.0
to get the last float number
s = pd.to_numeric(xy, 'coerce')
s.loc[s.last_valid_index()]

How to calculate the mean of a pandas DataFrame with NaN values

I have a DataFrame that looks like as follows (the address key is the index):
address date1 date2 date3 date4 date5 date6 date7
<email> NaN NaN NaN 1 NaN NaN NaN
I want to calculate the mean across a row, but when I use DataFrame.mean(axis=1), I get NaN (in the above example, I want a mean of 1). I get NaN even when I use DataFrame.mean(axis=1, skipna=True, numeric_only=True). How can I get the correct mean for the rows in this DataFrame?
Despite appearances your dtypes are not numeric hence the NaN values, you need to cast the type using astype:
df['date4'] = df['date4'].astype(int)
then it will work, depending on how you loaded/created this data then it should be something that you should correct at that stage rather than as a post-processing step if possible
You can confirm what the dtypes are but looking at the output from df.info() and also you can filter non-numeric columns out using select_dtypes: df.select_dtypes(include=[np.number]) to select just the numeric columns

Categories

Resources