I'm trying to use .str on my Pandas series in order to use the string operator methods but the .str converts all my data to NaN of the float64 dtype. The Pandas series is an object dtype to begin with.
Below I show my API series and then try to perform the split method on it.
In [80]:
wellinfo['API'].head()
Out[80]:
0 3501124153
1 3501124154
2 3501124155
3 3501124185
4 3501725290
Name: API, dtype: object
In [81]:
wellinfo['API'].str.split("0")
Out[81]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
...
1537 NaN
1538 NaN
1539 NaN
1540 NaN
1541 NaN
1542 NaN
Name: API, Length: 1543, dtype: float64
I've skimmed through the Pandas documentation but cannot find out why it is converting everything. I've also tried multiple methods besides the split method with the same results.
Any information is appreciated. Thank you.
simply select the str values and do the operation in this case replace. On the other values that are not str returns NaN. Use Series.fillna:
wellinfo['API'].str.split("0").fillna(wellinfo['API'])
or to modify int values:
wellinfo['API'].astype(str).str.split("0")
as suggested #Mstaino
I am creating a Pandas DataFrame from sequence of dicts.
The dicts are large and somewhat heterogeneous.
Some of the fields are dates.
I would like to automatically detect and parse the date fields.
This can be achieved by
df0 = pd.Dataframe.from_dict(dicts)
df0.to_csv('tmp.csv', index=False)
df = pd.read_csv('tmp.csv', parse_dates=True)
I would like to find a more direct way to do this.
Use pd.to_datetime with errors='ignore'
Only use on columns of dtype == object using select_dtypes. This prevents converting numeric columns into nonsensical dates.
'ignore' abandons the conversion attempt if any errors are encountered.
combine_first is used instead of update because update keeps the initial dtypes. Since they were object, this would mess it all up.
df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore').combine_first(df)
date0 date1 feuxdate notadate
0 2019-01-01 NaT NaN NaN
1 NaT NaT 0.0 NaN
2 NaT NaT NaN hi
3 NaT 2019-02-01 NaN NaN
Could've also gotten tricky with it using assign to deal with dtypes
df.assign(**df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore'))
Setup
dicts = [
{'date0': '2019-01-01'},
{'feuxdate': 0},
{'notadate': 'hi'},
{'date1': '20190201'}
]
df = pd.DataFrame.from_dict(dicts)
I've got a dataset with multiple missing sequences of varying lengths where I'd like to find the first valid numbers that occur before and after these sequences for some particular dates. In the sample dataset below, I would like to find the valid numbers for ColumnB that occur closest to the date 2018-11-26.
Datasample:
Date ColumnA ColumnB
2018-11-19 107.00 NaN
2018-11-20 104.00 NaN
2018-11-21 106.00 NaN
2018-11-22 105.24 80.00
2018-11-23 104.63 NaN
2018-11-26 104.62 NaN
2018-11-28 104.54 NaN
2018-11-29 103.91 86.88
2018-11-30 103.43 NaN
2018-12-01 106.13 NaN
2018-12-02 110.83 NaN
Expected output:
[80, 86.88]
Some details:
If it were the case that this particular sequence was the only one with missing values, I would have been able to solve it using For Loops, or the pandas functions first_valid_index() or isnull() as described in Pandas - find first non-null value in column, but that will rarely be the case.
I'm able to solve this using a few For Loops, but it's very slow for larger datasets and not very elegant, so I'd really like to hear other suggestions!
Try this way, get the index and slice to get the first valid number
idx= np.where(df['Date']=='2018-11-26')[0][0]
# idx 3
num = (df.loc[df.loc[:idx,'ColumnB'].first_valid_index(),'ColumnB'],
df.loc[df.loc[idx:,'ColumnB'].first_valid_index(),'ColumnB'])
num
(80.0, 86.879999999999995)
I'd try it this way:
import pandas as pd
import numpy as np
df_vld = df.dropna()
idx = np.argmin(abs(df_vld.index - pd.datetime(2018, 11,26)))
# 1
df_vld.loc[df_vld.index[idx]]
Out:
ColumnA 103.91
ColumnB 86.88
Name: 2018-11-29 00:00:00, dtype: float64
[df['ColumnB'].ffill().loc['2018-11-26'], df['ColumnB'].bfill().loc['2018-11-26']]
You can use ffill and bfill to create two columns with the value from before and after such as
df['before'] = df.ColumnB.ffill()
df['after'] = df.ColumnB.bfill()
then get the value for the dates you want with a loc
print (df.loc[df.Date == pd.to_datetime('2018-11-26'),['before','after']].values[0].tolist())
[80.0, 86.88]
and if you have a list of dates then you can use isin:
list_dates = ['2018-11-26','2018-11-28']
print (df.loc[df.Date.isin(pd.to_datetime(list_dates)),['before','after']].values.tolist())
[[80.0, 86.88], [80.0, 86.88]]
Here's a way to do it:
t = '2018-11-26'
Look for the index of the date t:
ix = df.loc[df.Date==t].index.values[0]
Keep positions of non-null values in ColumnB:
non_nulls = np.where(~df.ColumnB.isnull())[0]
Get the nearest non-null values both on top and bellow:
[df.loc[non_nulls[non_nulls < ix][-1],'ColumnB']] + [df.loc[non_nulls[non_nulls > ix][0],'ColumnB']]
[80.0, 86.88]
Am using Pandas df and in dataframe i was able to extract to series named 'xy' that look like this:
INVOICE
2014-08-14 00:00:00
4557
Printing
nan
Item AMOUNT
nan
1 9.6
0
0
0
9.6
2
11.6
nan
nan
nan
what i need to find is maximum value which is usually located towards the end of 'xy' pandas series i tried to convert it to string i get into problems as some of the series are string not int or float i need a good way as i am writing this script for several different files
try to use pd.to_numeric
pd.to_numeric(xy, 'coerce').max()
4557.0
to get the last float number
s = pd.to_numeric(xy, 'coerce')
s.loc[s.last_valid_index()]
I have a DataFrame that looks like as follows (the address key is the index):
address date1 date2 date3 date4 date5 date6 date7
<email> NaN NaN NaN 1 NaN NaN NaN
I want to calculate the mean across a row, but when I use DataFrame.mean(axis=1), I get NaN (in the above example, I want a mean of 1). I get NaN even when I use DataFrame.mean(axis=1, skipna=True, numeric_only=True). How can I get the correct mean for the rows in this DataFrame?
Despite appearances your dtypes are not numeric hence the NaN values, you need to cast the type using astype:
df['date4'] = df['date4'].astype(int)
then it will work, depending on how you loaded/created this data then it should be something that you should correct at that stage rather than as a post-processing step if possible
You can confirm what the dtypes are but looking at the output from df.info() and also you can filter non-numeric columns out using select_dtypes: df.select_dtypes(include=[np.number]) to select just the numeric columns