All,
I have a dataframe (df_live) with the following structure:
live live.updated live.latitude live.longitude live.altitude live.direction live.speed_horizontal live.speed_vertical live.is_ground
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
95 NaN NaN NaN NaN NaN NaN NaN NaN NaN
96 NaN 2022-10-11T17:46:19+00:00 -45.35 169.88 5791.2 44.0 518.560 0.0 False
97 NaN 2022-10-11T17:45:54+00:00 -27.55 143.20 11277.6 139.0 853.772 0.0 False
98 NaN NaN NaN NaN NaN NaN NaN NaN NaN
99 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I would like to iterate through this dataframe such that I only obtain rows for which numerical values are available (e.g. rows 96 and 97).
The code I am using is as follows:
import boto3
import json
from datetime import datetime
import calendar
import random
import time
import requests
import pandas as pd
aircraftdata = ''
params = {
'access_key': 'KEY',
'limit': '100',
'flight_status':'active'
}
url = "http://api.aviationstack.com/v1/flights"
api_result = requests.get('http://api.aviationstack.com/v1/flights', params)
api_statuscode = api_result.status_code
api_response = api_result.json()
df = pd.json_normalize(api_response["data"])
df_live = df[df.loc[:, df.columns.str.contains("live", case=False)].columns]
df_dep = df[df.loc[:, df.columns.str.contains("dep", case=False)].columns]
print(df_live)
for index, row in df_live.iterrows():
if df_live["live_updated"] != "NaN":
print (row)
else:
print ("Not live")
This yields the following error
KeyError: 'live_updated'
instead of iterating with the for loop, how about removing rows with all NaN in one go?
df_live = df_live[df_live.notnull().any(1)]
print(df_live)
Be careful with the column names. The key error
KeyError: 'live_updated'
means that there are no columns in the dataframe with the name of 'live_updated'.
If you check your dataframe columns, the actual name you probably want to refer to is 'live.updated', so just change the column name you are referring to on the code:
for index, row in df_live.iterrows():
if df_live["live.updated"] != "NaN":
print (row)
else:
print ("Not live")
Another solution could be to rename the dataframe columns before you refer to them:
df_live = df_live.rename(columns={'live.updated': 'live_updated'})
Related
I have several csv files in a directory of folders and subfolders. All the csv files have headers and time stamp as 1st column, whether time series data is present or not. I want to read all the csv files and should return status as empty if no data is present.
When I used df.empty function to check, it returns False even there is no data (the file has only header row and 1st column with time stamp).
The code I used is:
import pandas as pd
df1 = pd.read_csv("D://sirifort_with_data.csv", index_col=0)
df2 = pd.read_csv("D://sirifort_without_data.csv", index_col=0)
print(df1.empty)
print(df2.empty)
print(df2)
The result is:
False
False
PM2.5(ug/m3) PM10(ug/m3) ... NOx(ppb) NH3(ug/m3)
Time_Stamp ...
26/02/2022 0:00 NaN NaN ... NaN NaN
26/02/2022 0:15 NaN NaN ... NaN NaN
26/02/2022 0:30 NaN NaN ... NaN NaN
26/02/2022 0:45 NaN NaN ... NaN NaN
26/02/2022 1:00 NaN NaN ... NaN NaN
26/02/2022 1:15 NaN NaN ... NaN NaN
26/02/2022 1:30 NaN NaN ... NaN NaN
26/02/2022 1:45 NaN NaN ... NaN NaN
26/02/2022 2:00 NaN NaN ... NaN NaN
26/02/2022 2:15 NaN NaN ... NaN NaN
26/02/2022 2:30 NaN NaN ... NaN NaN
26/02/2022 2:45 NaN NaN ... NaN NaN
[12 rows x 6 columns]
Use the sum of one of the columns. in the case of empty df it is zero.
def col_check(col):
if df[col].sum()!=0:
return 1
for col in df.columns:
if col_check(col):
print('not empty')
break
The documentation clearly indicates:
If we only have NaNs in our DataFrame, it is not considered empty! We
will need to drop the NaNs to make the DataFrame empty:
df = pd.DataFrame({'A' : [np.nan]})
df.empty
False
and then suggests:
df.dropna().empty
True
I am working with multiple big data frames. I want to remove their NaN parts automatically to ease the data cleansing process. Data is collected from a camera or radar feed, but the part of the data I need is when a specific object comes into the view horizon of the camera/ radar. So, the data file (frame) looks like below, and has lots of NaN values:
total in seconds datetime(utc) channels AlviraPotentialDronePlots_timestamp AlviraPotentialDronPlot_id ...
0 1601381457 2020-09-29 12:10:57 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1601381459 2020-09-29 12:10:59 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1601381460 2020-09-29 12:11:00 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1601381461 2020-09-29 12:11:01 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1601381463 2020-09-29 12:11:03 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... Useful data is here ... ... ... ... ... ... ... ... ...
623 1601382249 2020-09-29 12:24:09 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
624 1601382250 2020-09-29 12:24:10 NaN NaN NaN NaN NaN NaN NaN NaN ... 51.521264 5.858627 5.0 NaN NaN SearchRadar 0.0 0.0 NaN NaN
625 1601382251 2020-09-29 12:24:11 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have removed the columns with all NaN values using:
df = df.dropna(axis=1, how='all')
Now, I want to remove rows that contain all NaN. However, since total in seconds and datetime(utc) are always present in the file, I cannot use the following command:
df = df.dropna(axis=0, how='all')
Also, I cannot use how='any', because that would remove parts of the useful data too (the useful data contains some NaN values which I will fill later). I have to use the dropna() in a way that it does not take the total in seconds and datetime(utc) into account, but if all other fields are NaNs, then removes the whole row.
The closest I came to solving this problem was the command mentioned in this link, but I guess I am not enough familiar with Python to be able to formulate the following logic:
if in one row field != [is not] 'total in seconds' | [or] 'datetime(utc)' & [and] other fields == [is] 'NaN' then remove the row
I tried writing this with for loop too, but I was not successful. Can someone help me with this?
Thanks in advance.
You can check all columns without total in seconds, datetime(utc) by subset parameter with Index.difference:
cols = ['total in seconds','datetime(utc)']
checked = df.columns.difference(cols)
df = df.dropna(subset=checked, how='all')
If your number of columns is constant, you can use the parameter thresh.
Lets say you have 50 columns, you could put the thresh at 48 if you have 2 columns that are never empty.
For more, check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
Low S0.0 S1.0 S2.0 S3.0 S4.0 S5.0 S6.0 S7.0 S8.0 S9.0 S10.0 S11.0
0 55 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 60 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 78 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 77 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the following code to check if any of the "S" columns are near to "close":
level=0.035
cond = np.isclose(df.Low, df['S0.0'], rtol=level) | np.isclose(df.Low, df['S1.0'], rtol=level) | ...
df['ST'] = np.where(cond, 100, 0)
But this looks too manual, is there some way to attribute all the S columns without specifically naming all of them? Also considering that these columns keep on changing so specifically calling every column sometimes gives an error. THANKS!
I think a solution can be as follows:
from itertools import repeat
from operator import or_
selected_columns = [c for c in df.columns if c.startswith('s')]
cond = None
for low_serie, sel_serie in zip(repeat(df.Low), [df[selected_column] for selected_column in selected_columns]):
if cond is None:
cond = np.isclose(low_serie, sel_serie, rtol=level)
continue
cond = or_(cond, np.isclose(low_serie, sel_serie, rtol=level))
You have to pay attention to the condition to select the columns names. I put as an example if c.startswith('s').
I forward fill values in the following df using:
df = (df.resample('d') # ensure data is daily time series
.ffill()
.sort_index(ascending=True))
df before forward fill
id a b c d
datadate
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1 NaN 3 4
1980-05-31 NaN NaN NaN NaN
... ... ... ...
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20 33
However, I wish to only forward fill one year after (date is datetime) the last observation and then the remaining rows simply be NaN. I am not sure what is the best way to introduce this criteria in this task. Any help would be super!
Thanks
If I understand you correctly, you want to forward-fill the values on Dec 31, 2019 to the next year. Try this:
end_date = df.index.max()
new_end_date = end_date + pd.offsets.DateOffset(years=1)
new_index = df.index.append(pd.date_range(end_date, new_end_date, closed='right'))
df = df.reindex(new_index)
df.loc[end_date:, :] = df.loc[end_date:, :].ffill()
Result:
a b c d
1980-01-31 NaN NaN NaN NaN
1980-02-29 NaN 2.0 NaN NaN
1980-03-31 NaN NaN NaN NaN
1980-04-30 1.0 NaN 3.0 4.0
1980-05-31 NaN NaN NaN NaN
2019-08-31 NaN NaN NaN NaN
2019-09-30 NaN NaN NaN NaN
2019-10-31 NaN NaN NaN NaN
2019-11-30 NaN NaN NaN NaN
2019-12-31 NaN NaN 20.0 33.0
2020-01-01 NaN NaN 20.0 33.0
2020-01-02 NaN NaN 20.0 33.0
...
2020-12-31 NaN NaN 20.0 33.0
One solution is to forward fill using a limit parameter, but this wont handle the leap-year:
df.fillna(mehotd='ffill', limit=365)
The second solution is to define a more robust function to do the forward fill in the 1-year window:
from pandas.tseries.offsets import DateOffsets
def fun(serie_df):
serie = serie_df.copy()
indexes = serie[~serie.isnull()].index
for idx in indexes:
mask = (serie.index >= idx) & (serie.index < idx+DateOffset(years=1))
serie.loc[mask] = serie[mask].fillna(method='ffill')
return serie
df_filled = df.apply(fun, axis=0)
If a column has multiple non-nan values in the same 1-year window, then the first fill will stop once the most recent value is encounter. The second solution will treat the consecutive value as if they were independent.
I have got the following dataframe, in which each column contains a set of values, and each index is only used once. However, I would like to get a completely filled dataframe. In order to do that I need to select, from each column, an X amount of values, in which X is the length of the column with the least non-nan values (in this case column '1.0').
>>> stat_df_iws
iws_w -2.0 -1.0 0.0 1.0
0 0.363567 NaN NaN NaN
1 0.183698 NaN NaN NaN
2 NaN -0.337931 NaN NaN
3 -0.231770 NaN NaN NaN
4 NaN 0.544836 NaN NaN
5 NaN -0.377620 NaN NaN
6 NaN NaN -0.428396 NaN
7 NaN NaN -0.443317 NaN
8 NaN -0.268033 NaN NaN
9 NaN 0.246714 NaN NaN
10 NaN NaN -0.503887 NaN
11 NaN NaN NaN -0.298935
12 NaN -0.252775 NaN NaN
13 NaN -0.447757 NaN NaN
14 -0.650598 NaN NaN NaN
15 -0.660542 NaN NaN NaN
16 NaN -0.952041 NaN NaN
17 -0.667356 NaN NaN NaN
18 -0.920873 NaN NaN NaN
19 NaN -0.537657 NaN NaN
20 NaN NaN -0.525121 NaN
21 NaN NaN NaN -0.619755
22 NaN -0.652138 NaN NaN
23 NaN -0.924181 NaN NaN
24 NaN -0.665720 NaN NaN
25 NaN NaN -0.336841 NaN
26 -0.428931 NaN NaN NaN
27 NaN -0.348248 NaN NaN
28 NaN 0.781024 NaN NaN
29 0.110727 NaN NaN NaN
... ... ... ... ...
I've achieved this with the following code, but it is not a very pythonic way of solving this.
def get_non_null_from_pivot(df):
lngth = min(list(len(col.dropna()) for ind, col in df.iteritems()))
df = pd.concat([df.loc[:,-2.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,-1.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,0.0].dropna().head(lngth).reset_index(drop=True),\
df.loc[:,1.0].dropna().head(lngth).reset_index(drop=True)], \
axis=1)
Is there a simpler way to achieve the same goal, so that I can more automatically repeat this step for other dataframes? Preferably without for-loops, for efficiency reasons.
I've made the function a little shorter by looping through the columns, and it seems to work perfectly.
def get_non_null_from_pivot_short(df):
lngth = min(list(len(col.dropna()) for ind, col in df.iteritems()))
df = pd.concat(list(df.loc[:,col].dropna().head(lngth).reset_index(drop=True) for col in df), \
axis=1)
return df