I have a data set where null value is
df.isnull().sum()
country 0
country_long 0
name 0
gppd_idnr 0
capacity_mw 0
latitude 46
longitude 46
primary_fuel 0
other_fuel1 0
other_fuel2 0
other_fuel3 908
commissioning_year 380
owner 0
source 0
url 0
geolocation_source 0
wepp_id 908
year_of_capacity_data 388
generation_gwh_2013 524
generation_gwh_2014 507
generation_gwh_2015 483
generation_gwh_2016 471
generation_gwh_2017 465
generation_data_source 0
estimated_generation_gwh 908
I tried mean mode max min and std all the methods but all null values is not removing
when I try
df['wepp_id']=df['wepp_id'].replace(np.NAN,df['wepp_id'].mean())
its not working same things happen on median , std and min, max also
Try df['wepp_id']=df['wepp_id'].fillna(df['wepp_id'].mean())
If this does not work, then it means that your column is not of number type. If it is an string type, then you need to do this first:
df['wepp_id'] = df['wepp_id'].astype(float)
Then run the command in the first line.
Related
I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].
I would like to replace the 0 with the string from the same column, previous row. Eg: 0 under Sheffield should read Sheffield. I am working with pandas.
file = file[['Branch', 'Type' ,'total']]
#replace NaN with 0
file.fillna(0).tail(6)
Out[48]:
Branch Type total
394 Sheffield Sum of Resend to Branch 0
395 0 Number of PV Enquiries 83
396 Wakefield Sum of Resend to Branch 0
397 0 Number of PV Enquiries 38
398 York Sum of Resend to Branch 1
399 0 Number of PV Enquiries 59
I have tried:
a) #create a series for that column and replace
branch = file.iloc[ :, 0]
branch.replace(0, branch(-1))
# why is this series not callable?
b)# I tried a loop in the dataframe
for item in file:
if "Branch" == 0:
replace(0, "Branch"[-1])
# I am unsure how to refer to the row above
Use replace with the method ffill
file_df['Branch'].replace(to_replace='0', method='ffill', inplace=True)
>>> file_df
Branch Type total
394 Sheffield Sum of Resend to Branch 0
395 Sheffield Number of PV Enquiries 83
396 Wakefield Sum of Resend to Branch 0
397 Wakefield Number of PV Enquiries 38
398 York Sum of Resend to Branch 1
399 York Number of PV Enquiries 59
Or, since it looks like you already replaced the NaN with 0, you could omit that step and just use ffill. i.e. if your original dataframe looks like:
>>> file_df
Branch Type total
394 Sheffield Sum of Resend to Branch 0
395 NaN Number of PV Enquiries 83
396 Wakefield Sum of Resend to Branch 0
397 NaN Number of PV Enquiries 38
398 York Sum of Resend to Branch 1
399 NaN Number of PV Enquiries 59
use:
file_df['Branch'].ffill(inplace=True)
Note that I called your dataframe file_df rather than file to not mask the python builtin
I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!
I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)
I´ve created a new column in a Dataframe that contains the categorical feature 'QD' which describes in which "decile" (the 10%, 20, 30% lower values) the value of another feature of the DataFrame is positioned. You can see the DF head below:
EPS CPI POC Vendeu Delta QD
1 20692 1 19185.30336 0 -1506.69664 QD07
8 20933 1 20433.27115 0 -499.72885 QD08
10 20393 1 20808.04948 0 415.04948 QD10
18 20503 1 19153.45978 0 -1349.54022 QD07
19 20587 1 20175.31906 1 -411.68094 QD09
Data Frame Head
The 'QD' column was created through the function below:
minimo = DF['EPS'].min()
passo = (DF['EPS'].max() - DF['EPS'].min())/10
def get_q(value):
for i in range(1,11):
if value < (minimo + (i*passo)):
return str('Q' + str(i).zfill(2))
Function applied on 'Delta'
Analyzing this column, I noticed something strange:
AUX2['QD'].unique()
out:
array(['QD07', 'QD08', 'QD10', 'QD09', 'QD06', 'QD05', 'QD04', 'QD03',
'QD02', 'QD01', None], dtype=object)
'QD' unique values
de .unique() method returns an array with an none value on it. At first I thought there was something wrong with the function, but when I tried to grab the position of the none value, look:
AUX2['QD'].value_counts()
out:
QD05 852
QD04 848
QD06 685
QD03 578
QD07 540
QD08 377
QD02 318
QD09 209
QD10 68
QD01 61
Name: QD, dtype: int64
.value_counts()
len(AUX2[AUX2['QD'] == None]['QD'])
out:
0
len()
What am I missing here?
When you are using .value_counts() add dropna=False
df[df['name column'].isnull()]
Working on a problem, I have the following dataframe in python
week hour week_hr store_code baskets
0 201616 106 201616106 505 0
1 201616 107 201616107 505 0
2 201616 108 201616108 505 0
3 201616 109 201616109 505 18
4 201616 110 201616110 505 0
5 201616 106 201616108 910 0
6 201616 107 201616106 910 0
7 201616 108 201616107 910 2
8 201616 109 201616108 910 3
9 201616 110 201616109 910 10
Here "hour" variable is a concat of "weekday" and "hour of shop", example weekday is monday=1 and hour of shop is 6am then hour variable = 106, similarly cal_hr is a concat of week and hour. I want to get those rows where i see a trend of no baskets , i.e 0 baskets for rolling 3 weeks. in the above case i will only get the first 3 rows. i.e. for store 505 there is a continuous cycle of 1 baskets from 106 to 108. But i do not want the rows (4,5,6) because even though there are 0 baskets for 3 continuous hours but the hours are actually NOT continuous. 110 -> 106 -> 107 . For the hours to be continuous they should lie in the range of 106 - 110.. Essentially i want all stores and the respective rows if it has 0 baskets for continuous 3 hours on any given day. Dummy output
week hour week_hr store_code baskets
0 201616 106 201616106 505 0
1 201616 107 201616107 505 0
2 201616 108 201616108 505 0
Can i do this in python using pandas and loops? The dataset requires sorting by store and hour. Completely new to python (
Do the following:
Sort by store_code, week_hr
Filter by 0
Store the subtraction between df['week_hr'][1:].values-df['week_hr'][:-1].values so you will get to know if they are continuos.
Now you can give groups to continuous and filter as you want.
import numpy as np
import pandas as pd
# 1
t1 = df.sort_values(['store_code', 'week_hr'])
# 2
t2 = t1[t1['baskets'] == 0]
# 3
continuous = t2['week_hr'][1:].values-t2['week_hr'][:-1].values == 1
groups = np.cumsum(np.hstack([False, continuous==False]))
t2['groups'] = groups
# 4
t3 = t2.groupby(['store_code', 'groups'], as_index=False)['week_hr'].count()
t4 = t3[t3.week_hr > 2]
print pd.merge(t2, t4[['store_code', 'groups']])
There's no need for looping!
You can solve:
Sort by store_code, week_hr
Filter by 0
Group by store_code
Find continuous
Code:
t1 = df.sort_values(['store_code', 'week_hr'])
t2 = t1[t1['baskets'] == 0]
grouped = t2.groupby('store_code')['week_hr'].apply(lambda x: x.tolist())
for store_code, week_hrs in grouped.iteritems():
print(store_code, week_hrs)
# do something