How to remove the NameError: name 'Dataset' is not defined - python

I just can't find what I am doing wrong in defining df1.
import pandas as pd
df = pd.read_csv(r"D:\Programming\Datasets\avocado.csv")
df1 = df[ df['region'] == 'Albany' ]
df1
NameError Traceback (most recent call last)
NameError: name 'df1' is not defined

Please use double equal while filtering,
import pandas as pd
df = pd.read_csv(r"D:\Programming\Datasets\avocado.csv")
df1 = df[df['region'] == 'Albany']
df1
I hope this helps,
Kind regards.

Please try to get results from code below. I wonder can you get filtered data,
filtered_region = df['region']=='Albany'
please check if filtered_region object is filled.
Than try like this
df1 = df[filtered_region]
df1

Is that your exact code? And you're running in Jupyter/Jupyterlabs, correct?
The code you pasted, with I'm assuming the Kaggle avocado.csv dataset, works for me. But I'm wondering if you are trying to call df1 before assignment. If I do either of these I get NameError: name 'df1' is not defined:
df = pd.read_csv('/Users/my_username/Downloads/avocado.csv')
df1 = df1[ df['region'] == 'Albany' ]
df1
or
df = pd.read_csv('/Users/my_username/Downloads/avocado.csv')
df1 = df[ df1['region'] == 'Albany' ]
df1
In both examples you can see how df1 is reference before it is assigned a value.

I used this command line and it solved my problem:
from netCDF4 import Dataset

For those using the PyTorch.
I solved my issue by, importing the Dataset class:
from torch.utils.data import Dataset

Related

How to solve this Pyspark Code Block using Regexp

I have this CSV file
but when I am running my notebook regex shows some error
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
dff.show(truncate=False)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
print(columnLabel)
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
As and a result I am getting this
Can anyone improvise this code, it will be a great help.
Expected output is
|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|
But I am getting
��Id��,��Version��,��Questionnaire��,��Date��
Second column is showing Truncated value
You will need to import the libraries you want to use first, to use them. The below code in a cell before the regexp_replace call should fix this issue
from pyspark.sql.functions import regexp_replace
This is working asnwer
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)

Pandas Dataframe display total

Here is an example dataset found from google search close to my datasets in my environment
I'm trying to get output like this
import pandas as pd
import numpy as np
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df=pd.DataFrame(data, columns=['Product','State','Sales'])
df1=df.sort_values('State')
#df1['Total']=df1.groupby('State').count()
df1['line']=df1.groupby('State').cumcount()+1
print(df1.to_string(index=False))
Commented out line throws this error
ValueError: Columns must be same length as key
Tried with size() it gives NaN for all rows
Hope someone points me to right direction
Thanks in advance
I think this should work for 'Total':
df1['Total']=df1.groupby('State')['Product'].transform(lambda x: x.count())
Try this:
df = pd.DataFrame(data).sort_values("State")
grp = df.groupby("State")
df["Total"] = grp["State"].transform("size")
df["line"] = grp.cumcount() + 1

drop rows with multiple conditions based on multiple column in python

I have a dataset (df) as below:
I want to drop rows based on condition when SKU is "abc" and packing is "1KG" & "5KG".
I have tried using following code:
df.drop( df[ (df['SKU'] == "abc") & (df['Packing'] == "10KG") & (df['Packing'] == "5KG") ].index, inplace=True)
Getting following error while trying above code:
NameError Traceback (most recent call last)
<ipython-input-1-fb4743b43158> in <module>
----> 1 df.drop( df[ (df['SKU'] == "abc") & (df['Packing'] == "10KG") & (df['Packing'] == "5KG") ].index, inplace=True)
NameError: name 'df' is not defined
Any help on this will be greatly appreciated. Thanks.
I suggest trying this:
df = df.loc[~((df['SKU'] == 'abc') & (df['packing'].isin(['1KG', '5KG']))]
The .loc is to help define the conditions and using ~ basically means 'NOT' those conditions.

Error when filtering DataFrame with a call to loc

I am a complete Python and Pandas novice. I am following a tutorial, and so far have the following code:
import numpy as np
import pandas as pd
import plotly as pyplot
import datetime
df = pd.read_csv("GlobalLandTemperaturesByCountry.csv")
df = df.drop("AverageTemperatureUncertainty", axis=1)
df = df.rename(columns={"dt": "Date"})
df = df.rename(columns={"AverageTemperature": "AvTemp"})
df = df.dropna()
df_countries = df.groupby(["Country", "Date"]).sum().reset_index().sort_values("Date", ascending=False)
start_date = "2001-01-01"
end_date = "2002-01-01"
mask = (df_countries["Date"] > start_date) & (df_countries["Date"] <= end_date)
df_mask = df_countries.loc(mask)
When I try and run the code, I get an error on the last line, i.e. df_mask = df_countries.loc(mask), the error being:
TypeError 'Series' objects are mutable, thus they cannot be hashed
I have already found several StackOverflow answers for this error, but none seem to match my scenario enough to help. Why am I getting this error?
In above example df_countries is dataframe and mask seems to be condition which is to be applied on this dataframe.
The object is mutable, meaning that its value can be changed without reassigning it the same variable, its contents will change at some point in the code. As a result, its hash value will change, so they cannot be hashed.
Try:
df_mask = df_countries.loc[(mask)]

pandas iterrows throwing error

I am trying to do a change data capture on two dataframes. The logic is to merge two dataframes and group by one keys and then run a loop for groups having count >1 to see which column 'updated'. I am getting strange error. any help is appreciated.
code
import pandas as pd
import numpy as np
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")
# defining metadata
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)
df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)
df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'
#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True)
first_pass = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep=False)]
first_pass_keep_duplicate = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep='first')]
group_by_1 = first_pass.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'])
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
print(first_pass)
And The error I get :
AttributeError: Cannot access callable attribute 'iterrows' of 'DataFrameGroupBy' objects, try using the 'apply' method
Any help is much appreciated.
Your GroupBy object supports iteration, so instead of
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
you need to do something like
for name, group in group_by_1:
print name
print group
then you can do what you need to do with each group
See the docs
Why not do as suggested and use apply? Something like:
def print_rows(rows):
print rows
group_by_1.apply(print_rows)

Categories

Resources