I am not a statistician or anything like that. I am working on a project where I got an excel file and I need to replicate the same actions that are made in the file to an html table.
I got most of the file right but am stuck on a function called FDIST which as I tried to understand means the function probability distribution. Now I tried to look for something that does the same thing in python (because I am using django as the server side) I came across the scipy library which helped a lot in the other actions I needed to do, but still I can't find something that does what FDIST in excel do. I found a function f.pdf but turns out it is not the same.
Can someone suggest a way to get the same result?
thanks.
You can read this to know more about F distribtion in general.
If you use the parameters x = 2.510, dfn = 3, dfd = 48 in Excel, you get:
Note that FDIST is available for compatibility with Excel 2007 and earlier, and was replaced by F.DIST (with Cumulative = True)
Using scipy.stats you get the same results:
>>>from scipy.stats import f
>>>x = 2.510
>>>dfn = 3
>>>dfd = 48
>>>f.cdf (x, dfn, dfd)
0.930177201089
>>>1- f.cdf (x, dfn, dfd)
0.0698227989112
Hope this helps.
Related
I am working on a primitive version of a script that performs factor analysis and computes some parameters for item response theory. I need to make this code run in AWS because I am requested so. However I have absolutely zero experience in Cloud computing and AWS and anything related to that (I am just somewhat OK with writing Python and MATLAB scripts).
Can anyone please suggest me the easiest way to make the following python code work in AWS in AN EASY-TO-IMPLEMENT way that is doable for a total noob (including changes that I need to make inside the python code):
P.S: I am expecting this script to give me the "estimates" and "ev" parameters. Converting this script to a function also did not work for me but probably the issue is different so I can convert this to a function with desired returns as well.
import os
import pandas as pd
import numpy as np
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
os.chdir('C:/Users/ege/Desktop/Neurolize') #change to where your excel file is.
#access the relevant sheet with answers of each participant
df = pd.read_excel(open('irtTestCase.xlsx', 'rb'),
sheet_name='Survey Module 1',skiprows = [0])
#drop irrelevant columns (User ID & Response Time)
df.drop(['UserId', 'Unnamed: 15'],axis=1,inplace=True)
#drop the participants that did not answer all of the questions
df.dropna(inplace=True)
#replace the answers with numeric values
df = df.replace(regex={'None of the time':1.0, 'Rarely':2.0, 'Some of the time':3.0,
'Often':4.0, 'All of the time':5.0})
#see if factor analysis can be performed on the data (p <0.05 = eligibility)
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_value
#perform factor analysis - get eigenvectors
fa = FactorAnalyzer(rotation = None,n_factors=df.shape[1])
fa.fit(df)
ev,_ = fa.get_eigenvalues()
#get the ratio of variance explained by the addition of each factor
variances=fa.get_factor_variance()
cum_variance = variances[2] # %variance explained via the addition of each factor
#plot the relative component amplitudes (eigenvalues)
plt.scatter(range(1,df.shape[1]+1),ev)
plt.plot(range(1,df.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigen Value')
plt.grid()
#get how much each question contributes to each of the factors
factorLoadings = fa.loadings_
'''
Traditional criteria is to consider each factor that has >1 eigenvalue as a
significant factor. So now we have 3 factors that have eigenvalues >1. So
it may be a good idea to exclude the questions that mainly load into the
second and third factors
'''
# trimmed_df = df.drop(["I've been feeling optimistic about the future","I've been feeling interested in other people"], axis=1)
# #perform factor analysis - get eigenvectors
# fa = FactorAnalyzer(rotation = None,n_factors=trimmed_df.shape[1])
# fa.fit(trimmed_df)
# ev,_ = fa.get_eigenvalues()
# factorLoadings = fa.loadings_
#item response theory
from girth import pcm_mml
df_transpose = df.T
df_array = df_transpose.values
df_int = df_array.astype(int)
estimates = pcm_mml(df_int)
I have tried forming .zip files and make an instance in EC2 and build a docker image... but I failed in all attempts and I am honestly just copying youtube videos on the topic which is really frustrating when you fail at copy pasting solutions. I think I have some fundamental issues with my python code in terms of compatibility with AWS.
With the method of adding a .zip file to the AWS lambda, I previously got this error even though my libraries are actually compatible with the python version I am using in AWS (3.9).
I thought maybe my local Python environment is 3.8 and that creates an issue but I am not sure about that either.
I'am trying to get lines from a text file (.log) into a .txt document.
I need get into my .txt file the same data. But the line itself is sometimes different. From what I have seen on internet, it's usualy done with a pattern that will anticipate how the line is made.
1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1
The line i'm working with usualy looks like this. The data i'am trying to collect are :
\ip\0.0.0.0
\name\NickName_of_the_player
\ja_guid\420D990471FC7EB6B3EEA94045F739B7
And print these data, inside a .txt file. Here is my current code.
As explained above, i'am unsure about what keyword to use for my research on google. And how this could be called (Because the string isn't the same?)
I have been looking around alot, and most of the test I have done, have allowed me to do some things, but i'am not yet able to do as explained above. So i'am in hope for guidance here :) (Sorry if i'am noobish, I understand alot how it works, I just didn't learned language in school, I mostly do small scripts, and usualy they work fine, this time it's way harder)
def readLog(filename):
with open(filename,'r') as eventLog:
data = eventLog.read()
dataList = data.splitlines()
return dataList
eventLog = readLog('games.log')
You'll need to read the files in "raw" mode rather than as strings. When reading the file from disk, use open(filename,'rb'). To use your example, I ran
text_input = r"1525:22Player 11 spawned with userinfo: \team\b\forcepowers\0-5-030310001013001131\ip\46.98.134.211:24806\rate\25000\snaps\40\cg_predictItems\1\char_color_blue\34\char_color_green\34\char_color_red\34\color1\65507\color2\14942463\color3\2949375\color4\2949375\handicap\100\jp\0\model\desann/default\name\Faybell\pbindicator\1\saber1\saber_malgus_broken\saber2\none\sex\male\ja_guid\420D990471FC7EB6B3EEA94045F739B7\teamoverlay\1"
text_as_array = text_input.split('\\')
You'll need to know which columns contain the strings you care about. For example,
with open('output.dat','w') as fil:
fil.write(text_as_array[6])
You can figure these array positions from the sample string
>>> text_as_array[6]
'46.98.134.211:24806'
>>> text_as_array[34]
'Faybell'
>>> text_as_array[44]
'420D990471FC7EB6B3EEA94045F739B7'
If the column positions are not consistent but the key-value pairs are always adjacent, we can leverage that
>>> text_as_array.index("ip")
5
>>> text_as_array[text_as_array.index("ip")+1]
'46.98.134.211:24806'
I'm looking for a recipe for converting Pandas DataFrames to RDF data in Python. I'm aware of the following Python modules (I know how to Google!), but they do not work for me:
rdfpandas
pandasrdf
Neither seems mature. I have problems with both. In the case of rdfpandas, I'm unable to install and there are no examples and insufficient documentation. In the case of pandasrdf, the example doesn't work and crashes. I can fix it, but the RDF file has zero triples, so the result is useless. I'd rather not have to write out the data to some intermediate data file that I have to injest later. Pandas->numpy->RDF would be OK I guess. Does anybody have a working example of converting a Pandas DataFrame to RDF in one of the common serialisation formats that does not involve an artisanal black magic package installation?
A newer version of RdfPandas is out, so you can try it out and see if it covers your use case: https://rdfpandas.readthedocs.io/en/latest (thanks to
Carmoreno for the prompt to fix the link)
Example based on https://github.com/cadmiumkitty/capability-models/blob/master/notebooks/investment_management_capabilities.csv is below
import pandas as pd
import rdfpandas
df = pd.read_csv('investment_management_capabilities.csv', index_col = '#id', keep_default_na = True)
g = rdfpandas.to_graph(df)
ttl = g.serialize(format = 'turtle')
with open('investment_management_capabilities.ttl', 'wb') as file:
file.write(ttl)
The code that does the conversion is pretty minimal and is here (just look at the to_graph method) https://github.com/cadmiumkitty/rdfpandas/blob/master/rdfpandas/graph.py, so you can use it directly as an inspiration to create your own conversion logic.
I tried really hard to find how to do these simple lines of VBA code in Python via win32com but I couldn't find how to execute it properly :
ActiveSheet.PivotTables("PivotTable1").PivotFields("Quarters").ClearAllFilters
ActiveSheet.PivotTables("PivotTable1").PivotFields("Effective deadline"). _
PivotFilters.Add2 Type:=xlBefore, Value1:="10/10/2017"
When running these lines :
from win32com.client import DispatchEx
excel = DispatchEx('Excel.Application')
wb = excel.Workbooks.Open('myfile.xlsx')
ws = wb.Worksheets('MySheet')
ws.PivotTables(1).PivotFields("Quarters").PivotFilters('Add2', 'xlBefore', '10/10/2017')
I end up with an 'Invalid number of parameters' so I guess I'm quite close but can't find the documentation to complete my code
Has anyone ever managed to do this kind of work ?
You are calling the wrong method. You should call .Add2 after the PivotFilters property:
ws.PivotTables(1).PivotFields("Effective deadline").ClearAllFilters()
ws.PivotTables(1).PivotFields("Effective deadline").PivotFilters.Add2(31, None, '10/10/2017')
Also, notice that you need to specify the XlPivotFilterType Enumeration according to the type of filter you want to apply (in this case xlBefore = 31)
I'm wondering if anyone knows a Python package that allows you to save numpy arrays/recarrays in the .dta format of the statistical data analysis software Stata. This would really speed up a few steps in a system I have.
The scikits.statsmodels package includes a reader for Stata data files, which relies in part on PyDTA as pointed out by #Sven. In particular, genfromdta() will return an ndarray, e.g.
from Python 2.7/statsmodels 0.3.1:
>>> import scikits.statsmodels.api as sm
>>> arr = sm.iolib.genfromdta('/Applications/Stata12/auto.dta')
>>> type(arr)
<type 'numpy.ndarray'>
The savetxt() function can be used in turn to save an array as a text file, which can be imported in Stata. For example, we can export the above as
>>> sm.iolib.savetxt('auto.txt', arr, fmt='%2s', delimiter=",")
and read it in Stata without a dictionary file as follows:
. insheet using auto.txt, clear
I believe a *.dta reader should be added in the near future.
The only Python library for STATA interoperability I could find merely provides read-only access to .dta files. The R foreign library however provides a function write.dta, and RPy provides a Python interface to R. Maybe the combination of these tools can help you.
pandas DataFrame objects now have a "to_stata" method. So you can do for instance
import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')
DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). This answer may look less elegant, but it is probably more efficient.