I am trying to run a python script so that I can create a household count based on the residential address column and residential city column. Both columns just contain strings.
The script that I have tried can be seen below:
dataset['id'] =dataset.groupby(['RESIDENTIAL_ADDRESS1','RESIDENTIAL_CITY']).ngroup()
dataset['household_count'] = dataset.groupby(['id'])['id'].transform('count')
Yet, it gives me this error after 20,000 rows:
DataSource.Error: ADO.NET: A problem occurred while processing your Python script. Here are the technical details: [DataFormat.Error] We couldn't convert to Number. Details:DataSourceKind=Python DataSourcePath=Python Message=A problem occurred while processing your Python script. Here are the technical details: [DataFormat.Error] We couldn't convert to Number. ErrorCode=-2147467259.
Is there any way to fix this? This code works in python every single time and the error code make absolutely no sense in Power BI and I would greatly appreciate any advice on how to do this with DAX.
I have not been able to reproduce your error, but I strongly suspect the source of the error to be the datatypes. In the Power Query Editor, try transforming your grouping variables to text. The fact that your query fails for a dataset larger than 20000 rows should have absolutely nothing to do with the problem. Unless, of course, the data content somehow changes after row 20000.
If you could describe your datasource and show the applied steps in the Power Query Editor that would be of great help for anyone trying to assist you. You could also try to apply your code one step at a time, meaning making one table using dataset['id'] =dataset.groupby(['RESIDENTIAL_ADDRESS1','RESIDENTIAL_CITY']).ngroup() and yet another table using dataset['household_count'] = dataset.groupby(['id'])['id'].transform('count')
I might as well show you how to do just that, and maybe at the same time cement my suspicion that the error lies in the datatypes and hopefully rule out other error sources.
I'm using numpy along with a few random city and street names to build a dataset that I hope represents the structure and datatypes of your real world dataset:
Snippet 1:
import numpy as np
import pandas as pd
np.random.seed(123)
strt=['Broadway', 'Bowery', 'Houston Street', 'Canal Street', 'Madison', 'Maiden Lane']
city=['New York', 'Chicago', 'Baltimore', 'Victory Boulevard', 'Love Lane', 'Utopia Parkway']
RESIDENTIAL_CITY=np.random.choice(strt,21000).tolist()
RESIDENTIAL_ADDRESS1=np.random.choice(strt,21000).tolist()
sample_dataset=pd.DataFrame({'RESIDENTIAL_CITY':RESIDENTIAL_CITY,
'RESIDENTIAL_ADDRESS1':RESIDENTIAL_ADDRESS1})
Copy that snippet, go to PowerBI Desktop > Power Query Editor > Transform > Run Python Script and run it to get this:
Then do the same thing with this snippet:
dataset['id'] =dataset.groupby(['RESIDENTIAL_ADDRESS1','RESIDENTIAL_CITY']).ngroup()
Now you should have this:
So far, your last step is called Changed Type 2. Right above is a step called dataset. If you click that you'll see that the datatype of ID there is a string ABC and that it changes to number 123 in the next step. With my settings, Power BI inserts the step Changed Type 2 automatically. Maybe that is not the case with you? It cerainly can be a potential error source.
Next, insert your last line as a step of it's own:
dataset['household_count'] = dataset.groupby(['id'])['id'].transform('count')
Now you should have the dataset like below, along with the same steps under Applied Steps:
With this setup, everything seems to be working fine. So, what do we know for sure by now?
The size of the dataset is not the problem
Your code itself is not the problem
Python should handle this perfectly in Power BI
And what do we suspect?
Your data is the problem - either missing values or wrong type
I hope this helps you out somehow. If not, then don't hesitate to let me know.
Related
I need to get this result for an assignment using python/sqlite3.
required result
I did my query in MYSQL, and I got the answer already to the assignment question. Since I am learning, I find it easier to do the queries using MySQL Workbench first.
result in MySQLWorkbench
However, When I try to do it in Jupyter notebook with Sqlite3, it only shows the zeros on the percentage column.
I am using the function pd.read_sql_query. I went to the documentation and could not find any arguments there that would do what I want, or I did not understand it. I played with the coarse_float argument, but it did not make a difference. I am learning, so sometimes, I do not understand the documentation completely.
query_results = pd.read_sql_query(query1,conn)
This is what I get in my Jupyter notebook:
Output un Jupyter Notebook
I know the numbers are there because if I multiple the column ”percentage_female_only_movie” fly 100, I see them. I would like to know how to show them like in MYSQLWorkbench.
Thank you for any help. An if you know any link where I can learn about this type of issues, I would love if you can share it.
Try df[colname] = df[colname].astype(float).
This would convert your column to float and you should see the values
I found the solution. I needed to CAST the numerator and denominator of the column I was generating in SELECT statement of my query.
SELECT SUBSTR(TRIM(m.year),-4) AS Movie_year,
ROUND(CAST(fcm.Female_Cast_Only_Movies*100 AS FLOAT)/ CAST(tm.movies_total AS FLOAT),2) AS Percentage_Female_only_Movie,
tm.movies_total As Total_movies
FROM Movie AS m
output:
enter image description here
I am an elementary Python programmer and have been using this module called "Pybaseball" to analyze sabermetrics data. When using this module, I came across a problem when trying to retrieve information from the program. The program reads a CSV file from any baseball stats site and outputs it onto a program for ease of use but the problem is that some of the information is not shown and is instead all replaced with a "...". An example of this is shown:
from pybaseball import batting_stats_range
data = batting_stats_range('2017-05-01', '2017-05-08')
print(data.head())
I should be getting:
https://github.com/jldbc/pybaseball#batting-stats-hitting-stats-for-players-within-seasons-or-during-a-specified-time-period
But the information is cutoff from 'TM' all the way to 'CS' and is replaced with a ... on my code. Can someone explain to me why this happens and how I can prevent it?
As the docs states, head() is meant for "quickly testing if your object has the right type of data in it." So, it is expected that some data may not show because it is collapsed.
If you need to analyze the data with more detail you can access specific columns with other methods.
For example, using iloc(). You can read more about it here, but essentially you can "ask" for a slice of those columns and then apply a new slice to get only nrows.
Another example would be loc(), docs here. The main difference being that loc() uses labels (column names) to filter data instead of numerical order of columns. You can filter a subset of specific columns and then get a sample of rows from that.
So, to answer your question "..." is pandas's way of collapsing data in order to get a prettier view of the results.
I'm currently trying to learn more about Deep learning/CNN's/Keras through what I thought would be a quite simple project of just training a CNN to detect a single specific sound. It's been a lot more of a headache than I expected.
I'm currently reading through this ignoring the second section about gpu usage, the first part definitely seems like exactly what I'm needing. But when I go to run the script, (my script is pretty much totally lifted from the section in the link above that says "Putting the pieces together, you may end up with something like this:"), it gives me this error:
AttributeError: 'DataFrame' object has no attribute 'file_path'
I can't find anything in the pandas documentation about a DataFrame.file_path function. So I'm confused as to what that part of the code is attempting to do.
My CSV file contains two columns, one with the paths and then a second column denoting the file paths as either positive or negative.
Sidenote: I'm also aware that this entire guide just may not be the thing I'm looking for. I'm having a very hard time finding any material that is useful for the specific project I'm trying to do and if anyone has any links that would be better I'd be very appreciative.
The statement df.file_path denotes that you want access the file_path column in your dataframe table. It seams that you dataframe object does not contain this column. With df.head() you can check if you dataframe object contains the needed fields.
I'm having trouble selecting specific values of a row with pandas.
I have a CSV file with confirmed cases of Coronavirus in each country each day. So obviously some countries started having cases in different days and progressed in different ways.
Dataframe of countries I'm trying to plot:
I would like to filter each row since de 50th confirmed case, which occurs on different days for each country.
I tried to use the command df[df['column']>50], but this works for a single column and I want to do for all columns.
All my life I worked just with procedural programming in python without libraries but this week I decided to start using some of them, so my library understanding is very limited and I don't know how to insert a for loop on a library function, which I think is the case here. This is also my first question on stack overflow, so if I am doing something wrong please tell me. Thank you!
I'm working through Wes McKinney's Python for Data Analysis. While I'm working in Python 3 and the book is written in Python 2, this is generally not an issue, and if anything a good exercise.
However, I've reached an impasse on Chapter 2, example 3: US Baby Names 1880 - 2010 (pg. 34). The purpose of the following code is to insert a column titled 'prop' that contains the fraction of babies given a name for each year and gender into the dataframe:
def add_prop(group):
births = group.births.astype(float)
group['prop']=births/births.sum()
return group
names=names.groupby(['year','sex']).apply(add_prop)
'names' is a dataframe with five columns ('name', 'sex', 'births', 'year', and this adds 'prop') and approximately 1.7 million rows. In order to test whether prop was added correctly, you then test when the proportions sum to approximately 1 with np.allclose(names.groupby(['year','sex']).prop.sum(), 1).
My problem is that the function runs unpredictably. Perhaps once out of every 15 or 20 runs np.allclose will be true, and the function will have been applied to the dataframe correctly. Otherwise np.allclose is false. Additionally, it's wrong in different ways. Later you use this dataframe to graph the proportion of births represented in the top 1000 names by sex, and the shape of that graph changes constantly. Some examples of graph change: I know the problem is in how proportion is being calculated and added because the rest of the dataframe doesn't vary.
What is introducing unpredictability into this example? While I suspect it's the .apply() command, I'm not sure and don't know how to test my hypothesis. It's been suggested to me that part of the code block is deprecated, but Jupyter Notebook doesn't come up with a warning and I haven't been able to find anything online. I've gone over my code twice, and overall it's virtually identical to the book's, and is identical in the case of this block. Thanks in advance.
I think the issue is in using a float data type for prop. Floats are bad where you need accuracy. That's why, in the book, he says the sum of the props should add to 1 or are "sufficiently close to (but perhaps not exactly equal to) 1"
I'm new to python myself so I don't know the best solution. In databases I avoid floats and use the decimal data type. Regardless, if we're going to find the proportion of each of over a million records, it's going to be tough to maintain any single accuracy.