DataFrame from dicts with automatic date parsing

DataFrame from dicts with automatic date parsing - python

I am creating a Pandas DataFrame from sequence of dicts.
The dicts are large and somewhat heterogeneous.
Some of the fields are dates.
I would like to automatically detect and parse the date fields.
This can be achieved by
df0 = pd.Dataframe.from_dict(dicts)
df0.to_csv('tmp.csv', index=False)
df = pd.read_csv('tmp.csv', parse_dates=True)
I would like to find a more direct way to do this.

Use pd.to_datetime with errors='ignore'
Only use on columns of dtype == object using select_dtypes. This prevents converting numeric columns into nonsensical dates.
'ignore' abandons the conversion attempt if any errors are encountered.
combine_first is used instead of update because update keeps the initial dtypes. Since they were object, this would mess it all up.
df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore').combine_first(df)
date0 date1 feuxdate notadate
0 2019-01-01 NaT NaN NaN
1 NaT NaT 0.0 NaN
2 NaT NaT NaN hi
3 NaT 2019-02-01 NaN NaN
Could've also gotten tricky with it using assign to deal with dtypes
df.assign(**df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore'))
Setup
dicts = [
{'date0': '2019-01-01'},
{'feuxdate': 0},
{'notadate': 'hi'},
{'date1': '20190201'}
]
df = pd.DataFrame.from_dict(dicts)

Related

How to fill missing dates with corresponding NaN in other columns

I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?

Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0

Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.

Does this occur because there is an NaN?

I have a list of floats, and when I try to convert it into series or dataframe
code
000001.SZ 1.305442
000002.SZ 1.771655
000004.SZ 2.649862
000005.SZ 1.373074
000006.SZ 1.115238
...
601512.SH 16.305734
688123.SH 53.395579
603995.SH 19.598881
688268.SH 70.174454
002972.SZ 19.644900
300811.SZ 24.042762
688078.SH 86.263280
603109.SH NaN
Length: 3753, dtype: float64
df = pd.DataFrame(data = mylist,columns = ["std_r_in20days"])
print(df)
s = pd.Series(mylist)
print(s)
The result is:
std_r_in20days
0 NaN
0 code
000001.SZ 1.305442
000002.SZ 1.77...
dtype: object
AttributeError: Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')
Does this occur because there is an NaN in mylist? If so, how can I fix it? I don't want to delete the row with NaN but just leave them there

Ser=pd.DataFrame([[1,2,3,4,None],[2,3,4,5,np.nan]])
Ser=Ser.replace(np.nan,0)
You can do it like this. There are also other functions in pandas like fillna().:)

Instead of removing the whole row , you can just replace those values with 0 using pandas function
df.fillna(0)
Also, learn to perform null check in the beginning of your script using
df.isna().sum().sum()
This will give you the number of NaN values in your whole dataframe

Any way to flag bad lines in pandas when reading an excel file?

pandas.read_csv has (warn, error) bad lines methods. I can't see any for pandas.read_excel. Is there a reason? For example, if I wanted to read an excel file where a column is supposed to be a datetime and the pandas.read_execl function encounters an int or str in one/few of the rows. Do i need to handle this myself?

In short, no I do not believe there is a way to do automatically do this with a parameter you pass to read_excel(). This is how to solve your problem though:
Let's say that when you read in your dataframe it looks like this:
df = pd.read_excel('Desktop/Book1.xlsx')
df
Date
0 2020-09-13 00:00:00
1 2
2 abc
3 2020-09-14 00:00:00
You can you pass errors='coerce' to pd.to_datetime():
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Date
0 2020-09-13
1 NaT
2 NaT
3 2020-09-14
Finally, you can drop those rows with:
df = df[df['Date'].notnull()]
df
Date
0 2020-09-13
3 2020-09-14

How to calculate the mean of a pandas DataFrame with NaN values

I have a DataFrame that looks like as follows (the address key is the index):
address date1 date2 date3 date4 date5 date6 date7
<email> NaN NaN NaN 1 NaN NaN NaN
I want to calculate the mean across a row, but when I use DataFrame.mean(axis=1), I get NaN (in the above example, I want a mean of 1). I get NaN even when I use DataFrame.mean(axis=1, skipna=True, numeric_only=True). How can I get the correct mean for the rows in this DataFrame?

Despite appearances your dtypes are not numeric hence the NaN values, you need to cast the type using astype:
df['date4'] = df['date4'].astype(int)
then it will work, depending on how you loaded/created this data then it should be something that you should correct at that stage rather than as a post-processing step if possible
You can confirm what the dtypes are but looking at the output from df.info() and also you can filter non-numeric columns out using select_dtypes: df.select_dtypes(include=[np.number]) to select just the numeric columns

pandas: how to write tz-aware datetime columns with some missing data to store

I have dataframes with columns of timezone-aware datetimes that I need to write to a store. However as soon as I have any missing values, writing fails. I set up a simple example:
pd.set_option('io.hdf.default_format','table')
mdstore = pd.HDFStore(storeFile, complevel=9, complib='blosc')
utcdate = pd.to_datetime('20000101', utc=True)
df1 = pd.DataFrame(columns=['UTCdatetime'], data=utcdate, index=pd.date_range('20140627', periods=2))
mdstore['/Test'] = df1
This works fine, as it has no missing values. If I introduce missing values though I get one of two different errors.
Error 1: If I then add another column with only one of the two rows populated, I get an error:
df1.loc['20140628','UTCdatetime2'] = utcdate
print(df1)
mdstore['/Test'] = df1
UTCdatetime UTCdatetime2
2014-06-27 2000-01-01 00:00:00+00:00 NaN
2014-06-28 2000-01-01 00:00:00+00:00 2000-01-01 00:00:00+00:00
Exception: cannot find the correct atom type -> [dtype->object,items->Index([u'UTCdatetime', u'UTCdatetime2'], dtype='object')] 'float' object has no attribute 'tzinfo'
Error 2: If instead I add a new row with NA populated I get a different error:
df1.loc[pd.to_datetime('20140629'),'UTCdatetime'] = pd.NaT
print(df1)
mdstore['/Test'] = df1
UTCdatetime
2014-06-27 2000-01-01 00:00:00+00:00
2014-06-28 2000-01-01 00:00:00+00:00
2014-06-29 NaN
TypeError: too many timezones in this block, create separate data columns
I'm hoping for a workaround that doesn't involve filling the NaN's with an arbitrary date. I tried fillna with pd.NaT but that didn't do anything. Thanks!
PS I also had an issue with combine_first and tz (pandas tzinfo lost by combine_first) - doesn't appear to be related but just in case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame from dicts with automatic date parsing - python

Related

How to fill missing dates with corresponding NaN in other columns

Does this occur because there is an NaN?

Any way to flag bad lines in pandas when reading an excel file?

How to calculate the mean of a pandas DataFrame with NaN values

pandas: how to write tz-aware datetime columns with some missing data to store

Categories

Resources