Am using Pandas df and in dataframe i was able to extract to series named 'xy' that look like this:
INVOICE
2014-08-14 00:00:00
4557
Printing
nan
Item AMOUNT
nan
1 9.6
0
0
0
9.6
2
11.6
nan
nan
nan
what i need to find is maximum value which is usually located towards the end of 'xy' pandas series i tried to convert it to string i get into problems as some of the series are string not int or float i need a good way as i am writing this script for several different files
try to use pd.to_numeric
pd.to_numeric(xy, 'coerce').max()
4557.0
to get the last float number
s = pd.to_numeric(xy, 'coerce')
s.loc[s.last_valid_index()]
Related
I have a list of floats, and when I try to convert it into series or dataframe
code
000001.SZ 1.305442
000002.SZ 1.771655
000004.SZ 2.649862
000005.SZ 1.373074
000006.SZ 1.115238
...
601512.SH 16.305734
688123.SH 53.395579
603995.SH 19.598881
688268.SH 70.174454
002972.SZ 19.644900
300811.SZ 24.042762
688078.SH 86.263280
603109.SH NaN
Length: 3753, dtype: float64
df = pd.DataFrame(data = mylist,columns = ["std_r_in20days"])
print(df)
s = pd.Series(mylist)
print(s)
The result is:
std_r_in20days
0 NaN
0 code
000001.SZ 1.305442
000002.SZ 1.77...
dtype: object
AttributeError: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')
Does this occur because there is an NaN in mylist? If so, how can I fix it? I don't want to delete the row with NaN but just leave them there
Ser=pd.DataFrame([[1,2,3,4,None],[2,3,4,5,np.nan]])
Ser=Ser.replace(np.nan,0)
You can do it like this. There are also other functions in pandas like fillna().:)
Instead of removing the whole row , you can just replace those values with 0 using pandas function
df.fillna(0)
Also, learn to perform null check in the beginning of your script using
df.isna().sum().sum()
This will give you the number of NaN values in your whole dataframe
I have the following data set:
Survived Not Survived
0 NaN 22.0
1 38.0 NaN
2 26.0 NaN
3 35.0 NaN
4 NaN 35.0
.. ... ...
886 NaN 27.0
887 19.0 NaN
888 NaN NaN
889 26.0 NaN
890 NaN 32.0
I want to remove all the rows which contains NaN so i wrote the following code(the dataset's name is titanic_feature_data):
titanic_feature_data = titanic_feature_data.dropna()
And when i try to display the new dataset i get the following result:
Empty DataFrame
Columns: [Survived, Not Survived]
Index: []
What's the problem ? and how can i fix it ?
By using titanic_feature_data.dropna(), you are removing all rows with at least one missing value. From the data you printed in your question, it looks like all rows contains at least one missing value. Is it possible that simply all your rows contains at least one missing value? If so, it makes total sense that your dataframe is empty after dropna(), right?
Having said that, perhaps you are looking to drop rows that have a missing value for one particular column, for example column Not Survived. Then you could use:
titanic_feature_data.dropna(subset='Not Survived')
Also, if you are confused about why certain rows are dropped, I recommend checking for missing values explicitly first, without dropping them. That way you can see which instances would have been dropped:
incomplete_rows = titanic_feature_data.isnull().any(axis=1)
incomplete_rows is a boolean series, which indicates whether a row contains any missing value or not. You can use this series to subset your dataframe and see which rows contain missing values (presumably all of them, given your example)
titanic_feature_data.loc[incomplete_rows, :]
I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.
Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.
I've got a dataset with multiple missing sequences of varying lengths where I'd like to find the first valid numbers that occur before and after these sequences for some particular dates. In the sample dataset below, I would like to find the valid numbers for ColumnB that occur closest to the date 2018-11-26.
Datasample:
Date ColumnA ColumnB
2018-11-19 107.00 NaN
2018-11-20 104.00 NaN
2018-11-21 106.00 NaN
2018-11-22 105.24 80.00
2018-11-23 104.63 NaN
2018-11-26 104.62 NaN
2018-11-28 104.54 NaN
2018-11-29 103.91 86.88
2018-11-30 103.43 NaN
2018-12-01 106.13 NaN
2018-12-02 110.83 NaN
Expected output:
[80, 86.88]
Some details:
If it were the case that this particular sequence was the only one with missing values, I would have been able to solve it using For Loops, or the pandas functions first_valid_index() or isnull() as described in Pandas - find first non-null value in column, but that will rarely be the case.
I'm able to solve this using a few For Loops, but it's very slow for larger datasets and not very elegant, so I'd really like to hear other suggestions!
Try this way, get the index and slice to get the first valid number
idx= np.where(df['Date']=='2018-11-26')[0][0]
# idx 3
num = (df.loc[df.loc[:idx,'ColumnB'].first_valid_index(),'ColumnB'],
df.loc[df.loc[idx:,'ColumnB'].first_valid_index(),'ColumnB'])
num
(80.0, 86.879999999999995)
I'd try it this way:
import pandas as pd
import numpy as np
df_vld = df.dropna()
idx = np.argmin(abs(df_vld.index - pd.datetime(2018, 11,26)))
# 1
df_vld.loc[df_vld.index[idx]]
Out:
ColumnA 103.91
ColumnB 86.88
Name: 2018-11-29 00:00:00, dtype: float64
[df['ColumnB'].ffill().loc['2018-11-26'], df['ColumnB'].bfill().loc['2018-11-26']]
You can use ffill and bfill to create two columns with the value from before and after such as
df['before'] = df.ColumnB.ffill()
df['after'] = df.ColumnB.bfill()
then get the value for the dates you want with a loc
print (df.loc[df.Date == pd.to_datetime('2018-11-26'),['before','after']].values[0].tolist())
[80.0, 86.88]
and if you have a list of dates then you can use isin:
list_dates = ['2018-11-26','2018-11-28']
print (df.loc[df.Date.isin(pd.to_datetime(list_dates)),['before','after']].values.tolist())
[[80.0, 86.88], [80.0, 86.88]]
Here's a way to do it:
t = '2018-11-26'
Look for the index of the date t:
ix = df.loc[df.Date==t].index.values[0]
Keep positions of non-null values in ColumnB:
non_nulls = np.where(~df.ColumnB.isnull())[0]
Get the nearest non-null values both on top and bellow:
[df.loc[non_nulls[non_nulls < ix][-1],'ColumnB']] + [df.loc[non_nulls[non_nulls > ix][0],'ColumnB']]
[80.0, 86.88]
I have a DataFrame that looks like as follows (the address key is the index):
address date1 date2 date3 date4 date5 date6 date7
<email> NaN NaN NaN 1 NaN NaN NaN
I want to calculate the mean across a row, but when I use DataFrame.mean(axis=1), I get NaN (in the above example, I want a mean of 1). I get NaN even when I use DataFrame.mean(axis=1, skipna=True, numeric_only=True). How can I get the correct mean for the rows in this DataFrame?
Despite appearances your dtypes are not numeric hence the NaN values, you need to cast the type using astype:
df['date4'] = df['date4'].astype(int)
then it will work, depending on how you loaded/created this data then it should be something that you should correct at that stage rather than as a post-processing step if possible
You can confirm what the dtypes are but looking at the output from df.info() and also you can filter non-numeric columns out using select_dtypes: df.select_dtypes(include=[np.number]) to select just the numeric columns