Data Analysis with Outliers - python

My code is supposed to return statistical analysis of approx 65 columns of data (questions from a survey). Sample data is given below, as well as the current code. Currently, the output only shows the columns that have no strings included (for the others, they return as NaN and don't even show up in the Excel).
I believe the issue is resulting from some of the data points being marked with 'No Data' and some marked with 'Outlier'
I'd like to learn a way to ignore the outlier/no data points and display statistics such as mean or median for the rest of the data. I'd also love to learn how to incorporate conditional functions to display results such as 'count of responses > 4.25' so that I can expand on the analysis.
Q1 Q2 Q3 Q4 Q5 Q6
4.758064516 4.709677419 4.629032258 Outlier 4.708994709 4.209677419
4.613821138 No Data 4.259259259 4.585774059 4.255927476 Outlier
4.136170213 4.309322034 4.272727273 4.297169811 No Data 4.29468599
4.481558803 4.581476323 4.359495445 4.558252427 4.767926491 3.829030007
4.468085106 4.446808511 4.425531915 4.446808511 4.423404255 4.14893617
Sample Desired Output (doesnt correlate to sample data):
Code:
import pandas as pd
from pandas import ExcelWriter
# Pull in Data
path = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx.xlsx"
sheet = 'Adjusted Data'
data = pd.read_excel(path,sheet_name=sheet)
#Data Analysis
analysis = pd.DataFrame(data.agg(['count','min','mean', 'median', 'std']), columns=data.columns).transpose()
print(analysis)
g1 = data.groupby('INDUSTRY').median()
print(g1)
g2 = data.groupby('New Zone').median()
print(g2)
#Excel
path2 = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx2.xlsx"
writer = ExcelWriter(path2)
g1.to_excel(writer,'x')
g2.to_excel(writer,'y')
analysis.to_excel(writer,'a')
data.to_excel(writer,'Adjusted Data')
writer.save()
EDIT
Count how many of the responses to Q1 are > X (in this case, K1 = COUNTIF(K1:K999,TRUE))
I want this the values found in K1 & M1 (and so on for all of the questions) to be added to the analysis table like below:

This happens exactly because of the Strings. Thay cannot be summed with double numbers. It is an undefined operation hence the Nan.
Try and cleanup the data.
Options are:
Drop the rows that contain no data or outliers if this makes sense in your statistic. (You can do that even one column at a time, computing statistics for one column at a time).
Substitute those values with the mean of that column (this is one of the standard procedures in statistics).
Think of a domain specific way to treat this kind of data.
Anyway I would try to remove the strings from the data.
If you cannot do that, it probably means that this data doesn't belong together with the rest because it comes from a different distribution.

Related

Reshaping a material science Dataset (probably using melt() )

I'm dealing with a materials science dataset and I'm in the following situation,
I have data organized like this:
Chemical_ Formula Property_name Property_Scalar
He Electrical conduc. 1
NO_2 Resistance 50
CuO3 Hardness
... ... ...
CuO3 Fluorescence 300
He Toxicity 39
NO2 Hardness 80
... ... ...
As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, how can I easily maybe split the dataset in smaller ones, fitting every formula with its descriptors in ORDER? ( I used fiction names and values, just to explain my problem.)
I'm on Jupyter Notebook and I'm using Pandas.
I'm editing my question trying to be more clear:
My goal would be to plot some histograms of (for example) n°materials vs conductivity at different temperatures (100K, 200K, 300K). So I need to have both conductivity and temperature for each material to be clearly comparable. For example, I guess that a more convenient thing to obtain would be:
Chemical formula Conductivity Temperature
He 5 10K
NO_2 7 59K
CuO_3 10 300K
... ... ...
He 14 100K
NO_2 5 70K
... ... ...
I think that this issue can be related to reshaping the dataset but I should also have each formula to MATCH exactly the temperature and conductivity. Thank you for your help!
If you want to plot Conductivity versus Temperature for a given formula, you can simly select the rows that match this condition.
import pandas as pd
import matplotlib.pyplot as plt
formula = 'NO_2'
subset = df.loc[df['Chemical_Formula'] == formula].sort_values('Temperature')
x = subset['Temperature'].values
y = subset['Conductivity'].values
plt.plot(x, y)
Here, we are defining the formula you want to extract. Then we are selecting only the rows in the DataFrame where the value in the column 'Chemical Formula' matches your specified formula using df.loc[]. This returns a new DataFrame that is a subset of your original DataFrame that contains only rows where our condition is satisfied. We sort this subset by 'Temperature' (I assume you want to plot Temperature on the x-axis) and store it as subset. We then select the 'Temperature' and 'Conductivity' columns which return pandas.Seriesobjects, which we convert to numpy arrays by calling .values. We store these in x and y variables and pass them to the matplotlib plot function.
EDIT:
To get from the first DataFrame to the second DataFrame described in your post, you can use the pivot function (assuming your first DataFrame is named df):
df = df.pivot(index='Chemical_Formula', columns='Property_name', values='Property_Scalar')

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

Panda converting data to NaN when adding to a new DataSet

I´ve been trying to extract specific data from a given data set and add it in a new one in a specific set of organized column. I'm doing this by reading a CSV file and using the string function. The problem is that even though the data is extracted correctly Pandas will add the second column as NaN even though there is data stored in the affected variable, please see my code below, any idea on how to fix this?
processor=pd.DataFrame()
Hospital_beds="SH.MED.BEDS.ZS"
Mask1=data["IndicatorCode"].str.contains(Hospital_beds)
stage=data[Mask1]
Hospital_Data=stage["Value"]
Birth_Rate="SP.DYN.CBRT.IN"
Mask=data["IndicatorCode"].str.contains(Birth_Rate)
stage=data[Mask]
Birth_Data=stage["Value"]
processor["Countries"]=stage["CountryCode"]
processor["Birth Rate per 1000 people"]=Birth_Data
processor["Hospital beds per 100 people"]=Hospital_Data
processor.head(10)
The problem here is that the indices are not matching up. When you initially populate the processor data frame you are using each line from the original dataframe that contained birth rate data. These lines are different from the ones that contain the hospital beds data so when you do
processor["Hospital beds per 100 people"] = Hospital_Data
pandas will create the new column, but since there are no matching indices for the Hospital_Data in processor it will just contain null values.
Probably what you first want to do is re-index the original data using the country code and the year
data.set_index(['CountryCode','Year'], inplace=True)
You can then create a view of just the indicators you are interested in
indicators = ['SH.MED.BEDS.ZS', 'SP.DYN.CBRT.IN']
dview = data[data.IndicatorCode.isin(indicators)]
Finally you can then pivot on the indicator code to view each indicator on the same line
dview.pivot(columns='IndicatorCode')['Value']
But note this will still contain a lot of NaNs. This is just because the hospital bed data is updated very infrequently (or e.g. in Aruba not at all). But you can filter these out as needed.

Python interpolate throws no errors - but also does nothing

I am trying some DataFrame manipulation in Pandas that I have learnt. The dataset that I am playing with is from the EY Data Science Challenge.
This first part may be irrelevant but just for context - I have gone through and set some indexes:
import pandas as pd
import numpy as np
# loading the main dataset
df_main = pd.read_csv(filename)
'''Sorting Indexes'''
# getting rid of the id column
del df_main['id']
# sorting values by LOCATION and GENDER columns
# setting index to LOCATION (1st tier) then GENDER (2nd tier) and then re-
#sorting
df_main = df_main.sort_values(['LOCATION','TIME'])
df_main = df_main.set_index(['LOCATION','TIME']).sort_index()
The problem I have is with the missing values - I have decided that columns 7 ~ 18 can be interpolate because a lot of the data is very consistent year by year.
So I made a simple function to take in a list of columns and apply the interpolate function for each column.
'''Missing Values'''
x = df_main.groupby("LOCATION")
def interpolate_columns(list_of_column_names):
for column in list_of_column_names:
df_main[column] = x[column].apply(lambda x: x.interpolate(how = 'linear'))
interpolate_columns( list(df_main.columns[7:18]) )
However, the problem I am getting is one of the columns (Access to electricity (% of urban population with access) [1.3_ACCESS.ELECTRICITY.URBAN]) seems to not be interpolating when all the other columns are interpolated successfully.
I get no errors thrown when I run the function, and it is not trying to interpolate backwards either.
Any ideas regarding why this problem is occurring?
EDIT: I should also mention that the column in question was missing the same number of values - and in the same rows - as many of the other columns that interpolated successfully.
After looking at the data more closely, it seems like interpolate was not working on some columns because I was missing data at the first rows of the group in the groupby object.

how to do a nested for-each loop with PySpark

Imagine a large dataset (>40GB parquet file) containing value observations of thousands of variables as triples (variable, timestamp, value).
Now think of a query in which you are just interested in a subset of 500 variables. And you want to retrieve the observations (values --> time series) for those variables for specific points in time (observation window or timeframe). Such having a start and end time.
Without distributed computing (Spark), you could code it like this:
for var_ in variables_of_interest:
for incident in incidents:
var_df = df_all.filter(
(df.Variable == var_)
& (df.Time > incident.startTime)
& (df.Time < incident.endTime))
My question is: how to do that with Spark/PySpark? I was thinking of either:
joining the incidents somehow with the variables and filter the dataframe afterward.
broadcasting the incident dataframe and use it within a map-function when filtering the variable observations (df_all).
use RDD.cartasian or RDD.mapParitions somehow (remark: the parquet file was saved partioned by variable).
The expected output should be:
incident1 --> dataframe 1
incident2 --> dataframe 2
...
Where dataframe 1 contains all variables and their observed values within the timeframe of incident 1 and dataframe 2 those values within the timeframe of incident 2.
I hope you got the idea.
UPDATE
I tried to code a solution based on idea #1 and the code from the answer given by zero323. Work's quite well, but I wonder how to aggregate/group it to the incident in the final step? I tried adding a sequential number to each incident, but then I got errors in the last step. Would be cool if you can review and/or complete the code. Therefore I uploaded sample data and the scripts. The environment is Spark 1.4 (PySpark):
Incidents: incidents.csv
Variable value observation data (77MB): parameters_sample.csv (put it to HDFS)
Jupyter Notebook: nested_for_loop_optimized.ipynb
Python Script: nested_for_loop_optimized.py
PDF export of Script: nested_for_loop_optimized.pdf
Generally speaking only the first approach looks sensible to me. Exact joining strategy on the number of records and distribution but you can either create a top level data frame:
ref = sc.parallelize([(var_, incident)
for var_ in variables_of_interest:
for incident in incidents
]).toDF(["var_", "incident"])
and simply join
same_var = col("Variable") == col("var_")
same_time = col("Time").between(
col("incident.startTime"),
col("incident.endTime")
)
ref.join(df.alias("df"), same_var & same_time)
or perform joins against particular partitions:
incidents_ = sc.parallelize([
(incident, ) for incident in incidents
]).toDF(["incident"])
for var_ in variables_of_interest:
df = spark.read.parquet("/some/path/Variable={0}".format(var_))
df.join(incidents_, same_time)
optionally marking one side as small enough to be broadcasted.

Categories

Resources