Reshaping a material science Dataset (probably using melt() ) - python

I'm dealing with a materials science dataset and I'm in the following situation,
I have data organized like this:
Chemical_ Formula Property_name Property_Scalar
He Electrical conduc. 1
NO_2 Resistance 50
CuO3 Hardness
... ... ...
CuO3 Fluorescence 300
He Toxicity 39
NO2 Hardness 80
... ... ...
As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, how can I easily maybe split the dataset in smaller ones, fitting every formula with its descriptors in ORDER? ( I used fiction names and values, just to explain my problem.)
I'm on Jupyter Notebook and I'm using Pandas.
I'm editing my question trying to be more clear:
My goal would be to plot some histograms of (for example) nĀ°materials vs conductivity at different temperatures (100K, 200K, 300K). So I need to have both conductivity and temperature for each material to be clearly comparable. For example, I guess that a more convenient thing to obtain would be:
Chemical formula Conductivity Temperature
He 5 10K
NO_2 7 59K
CuO_3 10 300K
... ... ...
He 14 100K
NO_2 5 70K
... ... ...
I think that this issue can be related to reshaping the dataset but I should also have each formula to MATCH exactly the temperature and conductivity. Thank you for your help!

If you want to plot Conductivity versus Temperature for a given formula, you can simly select the rows that match this condition.
import pandas as pd
import matplotlib.pyplot as plt
formula = 'NO_2'
subset = df.loc[df['Chemical_Formula'] == formula].sort_values('Temperature')
x = subset['Temperature'].values
y = subset['Conductivity'].values
plt.plot(x, y)
Here, we are defining the formula you want to extract. Then we are selecting only the rows in the DataFrame where the value in the column 'Chemical Formula' matches your specified formula using df.loc[]. This returns a new DataFrame that is a subset of your original DataFrame that contains only rows where our condition is satisfied. We sort this subset by 'Temperature' (I assume you want to plot Temperature on the x-axis) and store it as subset. We then select the 'Temperature' and 'Conductivity' columns which return pandas.Seriesobjects, which we convert to numpy arrays by calling .values. We store these in x and y variables and pass them to the matplotlib plot function.
EDIT:
To get from the first DataFrame to the second DataFrame described in your post, you can use the pivot function (assuming your first DataFrame is named df):
df = df.pivot(index='Chemical_Formula', columns='Property_name', values='Property_Scalar')

Related

Data Analysis with Outliers

My code is supposed to return statistical analysis of approx 65 columns of data (questions from a survey). Sample data is given below, as well as the current code. Currently, the output only shows the columns that have no strings included (for the others, they return as NaN and don't even show up in the Excel).
I believe the issue is resulting from some of the data points being marked with 'No Data' and some marked with 'Outlier'
I'd like to learn a way to ignore the outlier/no data points and display statistics such as mean or median for the rest of the data. I'd also love to learn how to incorporate conditional functions to display results such as 'count of responses > 4.25' so that I can expand on the analysis.
Q1 Q2 Q3 Q4 Q5 Q6
4.758064516 4.709677419 4.629032258 Outlier 4.708994709 4.209677419
4.613821138 No Data 4.259259259 4.585774059 4.255927476 Outlier
4.136170213 4.309322034 4.272727273 4.297169811 No Data 4.29468599
4.481558803 4.581476323 4.359495445 4.558252427 4.767926491 3.829030007
4.468085106 4.446808511 4.425531915 4.446808511 4.423404255 4.14893617
Sample Desired Output (doesnt correlate to sample data):
Code:
import pandas as pd
from pandas import ExcelWriter
# Pull in Data
path = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx.xlsx"
sheet = 'Adjusted Data'
data = pd.read_excel(path,sheet_name=sheet)
#Data Analysis
analysis = pd.DataFrame(data.agg(['count','min','mean', 'median', 'std']), columns=data.columns).transpose()
print(analysis)
g1 = data.groupby('INDUSTRY').median()
print(g1)
g2 = data.groupby('New Zone').median()
print(g2)
#Excel
path2 = r"C:\Users\xx.xx\desktop\Python\PyTest\Pyxx2.xlsx"
writer = ExcelWriter(path2)
g1.to_excel(writer,'x')
g2.to_excel(writer,'y')
analysis.to_excel(writer,'a')
data.to_excel(writer,'Adjusted Data')
writer.save()
EDIT
Count how many of the responses to Q1 are > X (in this case, K1 = COUNTIF(K1:K999,TRUE))
I want this the values found in K1 & M1 (and so on for all of the questions) to be added to the analysis table like below:
This happens exactly because of the Strings. Thay cannot be summed with double numbers. It is an undefined operation hence the Nan.
Try and cleanup the data.
Options are:
Drop the rows that contain no data or outliers if this makes sense in your statistic. (You can do that even one column at a time, computing statistics for one column at a time).
Substitute those values with the mean of that column (this is one of the standard procedures in statistics).
Think of a domain specific way to treat this kind of data.
Anyway I would try to remove the strings from the data.
If you cannot do that, it probably means that this data doesn't belong together with the rest because it comes from a different distribution.

How to work with aggregated data in pandas?

I have a dataset which looks like this:
val
1
1
3
4
6
6
9
...
I can't load it into pandas dataframe due to it's huge size. So I aggregate data using Spark to form:
val occurrences
1 2
3 1
4 1
6 2
9 1
...
and load it into pandas dataframe. "val" column is not above 100, so it doesn't take much memory.
My problem is, I can't operate easily on such structure, e.g. find mean or median using pandas nor plot a boxplot with seaborn. I can do it only using explicit formulas written by me, but not ready builtin methods. Is there a pandas structure or any other way, which allows to cope with such data?
For example:
1,1,3,4,6,6,9
would be:
df = pd.DataFrame({'val': [1,3,4,6,9], "occurrences" : [2,1,1,2,1]})
Median is 4. I'm looking for a method to extract median directly from given df.
No, pandas does not operate on such objects how you would expect. Elsewhere on StackOverflow, even computing a median for that table structure takes at least a few lines of code.
If you wanted to make your own seaborn hooks/wrappers, a good place to start would probably be an efficient percentiles(df, p) method. The median is then just percentiles(df, [50]). A box plot would just be percentiles(df, [0, 25, 50, 75, 100]), and so on. Your development time could then be fairly minimal (depending on how complicated the statistics you need are).

Python - NaN values in df.corr

I am finishing a work and I am trying to check the correlation between some informations.
Basically I have the data from survivors from a incident and I want to know the correlation between other informations with their survavility.
So, I have the main df with all informations, then:
#creating a df to list who not survived(0) and another df to list who survived(1)
Input: df_s0 = df.query("Survived == 0")
df_s1 = df.query("Survived == 1")
Input: df_s0.corr()
Based on correlation formula:
cor(a,b) = cov(a,b)/(stdev(a) * stdev(b))
If either a or b are all constant (zero variance) then correlation between those two are not defined (division by zero producing NaNs).
In your example, the Survived column of df_s0 is constant (all zeros) and hence correlation is undefined for this column with other columns.
If you want to figure out the relationship between a discrete variable (Survived) and the rest of your features, you can look at the box plots (to be able to compare different statistics like mean, IQR,...) of your features across different groups of Survived 0 and 1. If you want to go a step further you can use ANOVA to characterize the importance of your features based on their variance within and across different groups!

How to manipulate a matrix of data in python?

I'd like to create code that can read create a histogram from a matrix of data that contains information about movies. The data set (matrix) contains several columns, and I'm interested in the column that contains movie release years and another column that says whether or not they pass the bechtel test (the data set defines "Pass" and "Fail" as indicators of whether a movie passed or failed the test). Knowing the nth column number of these two columns (release year and pass/fail), how can I create a histogram of the movies that fail the test, with the x axis containing bins of movie years? The bin sizes are not too important, whatever pyplot defaults to would be fine.
What I can do (which is not a lot) is this:
plt.hist(year_by_Test_binary[:,0])
which creates a pretty but meaningless histogram of how many movies were released in bins of years (the matrix has years in the 0th column).
If you couldn't already tell, I am python-illiterate and struggling. Any help would be appreciated.
Assuming n is the column of the Bechdel test, and that your data is numpy like:
plt.hist([matrix[matrix[:,n] == 'Pass', 0], matrix[matrix[:,n] == 'Fail', 0]])
We're giving numpy two vectors of years, one with movies passing and one with movies failing. It will then create two histograms for each category, so you can visually identify changes to the categories.
for to convert a data to an matrix use :
numpy.asarray(data)
and to present in a histogram you can use :
plt.plot(data)
or
plt.hist(data, bins)
bins is the niveau of your data

How to apply a low-pass filter of 5Hz to a pandas dataframe?

I have a pandas.DataFrame indexed by time, as seen below. The other column contains data recorded from a device measuring current. I want to filter to the second column by a low pass filter with a frequency of 5Hz to eliminate high frequency noise. I want to return a dataframe, but I do not mind if it changes type for the application of the filter (numpy array, etc.).
In [18]: print df.head()
Time
1.48104E+12 1.1185
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
I am graphing this data by df.plot(legend=True, use_index=False, color='red') but would like to graph the filtered data instead.
I am using pandas 0.18.1 but I can change.
I have visited https://oceanpython.org/2013/03/11/signal-filtering-butterworth-filter/ and many other sources of similar approaches.
Perhaps I am over-simplifying this but you create a simple condition, create a new dataframe with the filter, and then create your graph from the new dataframe. Basically just reducing the dataframe to only the records that meet the condition. I admit I do not know what the exact number is for high frequency, but let's assume your second column name is "Frequency"
condition = df["Frequency"] < 1.0
low_pass_df = df[condition]
low_pass_df.plot(legend=True, use_index=False, color='red')

Categories

Resources