How to deal with missing value in Pandas DataFrame from open data? - python

I have downloaded ten open datasets of air pollution in 2010-2019 (which has been transferred to Pandas DataFrame by 'read_csv') that have some missing values.
The rows are ordered by each day including several items (like PM2.5, SO2,...). Most of the data include 17 or 18 items. There are 27 columns which separately are Year, Station, Item, 00, 01, ..., 23.
In this case, I already used
df.fillna(np.nan).apply(lambda x: pd.to_numeric(x,errors='coerce')
and df.interpolate(axis=1,inplace=True)
But now if the data have missing values from '00' to anytime following, the interpolate function would not works. If I want to fill all these blanks, I need to merge the last day data which is not null and use interpolate again.
However, different days have different items numbers, which means there are still some rows that can't be filled.
In a nutshell, now I'm trying to contact all data by the key of items and use interpolate.
By the way, after data cleaning, I would like to apply to xgboost and linear regression to predict PM2.5. Is there any way recommended to deal with the data?
(Or any demo code online?)
For example, the data would be like:
one of the datasets
I used df.groupby('date').size() and got
size of different days

Or in other words, how to split different days and concat together?
Groupby(['date','items'])? and then how to merge?
Or, is that possible to interpolate from the last value of the last row?

Related

Sort the DataFrames columns which are dynamically generated

I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)

Pandas Python Dataframe

I have a dataset with YYYY-MM as data, however I want to find the mean of the temperature for the year, therefore I need to add up the 12 months in a year, and find the summary. How do I do that using Pandas?
An example of my data: (I have more than a year dataset, tried to reshape them, but it doesn't seem to work)
Ket us do string slice then groupby + sum
s=df.groupby(df['month'].str[:4]).sum()

Create equidistant data frame with time ranged data with Python

I have a .cvs file in which data is stored for data ranges - from and to date columns. However, I would like to create a daily data frame with Python out of it.
The time can be ignored, as a gasday always starts at 6am and ends at 6am.
My idea was to have in the end a data frame index with a date (like from March 1st, 2019, ranging to December 31st, 2019 on a daily granularity.
I would create columns with the unique values of the identifier and as values place the respective values or nan in.
The latter one, I can easily do with pd.pivot_table, but still my problem with the time range exists...
Any ideas of how to cope with that?
time-ranged data frame
It should look like this, just with rows in a daily granularity, considering the to column as well. Maybe with range?
output should look similar to this, just with a different period
you can use pandas and groupby the column you want:
df=pd.read_csv("yourfile.csv")
groups=df.groupby("periodFrom")
group.get_group("2019-03-09 06:00")

Conditional Average from Pandas DataFrame

I have a dataframe with multiple columns of real estate sales data. I would like to find the average price-per-square-foot 'ppsf' for all 1bed-1bath sales by zip code. Here is my attempt (each key in the dict is a zip code):
bed1_bath1={}
for zip in zip_codes:
bed1_bath1[zip]= (df.loc[(df['bed']==1) & (df['bath']==1) & (df['zip']==zip)]).mean()
The problem is that this adds the mean of all columns from the dataframe to the dictionary. I'm sure there is a better way to do this; maybe using numpy.where?
(df[(df['bed']==1) & (df['bath']==1) & (df['zip']==zip)])['ppsf'].mean() would do it. You simply choose the column you are interested in before calculating the mean (so you will not even do the processing for the rest of the columns).

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Categories

Resources