Spark - length of element of row - python

I am trying to do a filter operation to get all the rows where the length of my variable country is less than 4 and I keep getting errors no matter what I do.
This is the current code (using the Python API)
uniqueRegions = sqlContext.sql("SELECT country, city FROM df")
uniqueRegions = uniqueRegions.rdd
uniqueRegions = uniqueRegions.distinct()
uniqueRegions = uniqueRegions.filter(lambda line: len(line.country) < 4)
This is the error
TypeError: object of type 'NoneType' has no len()
And the first row (done with rdd.first):
Row(country=u'xxxxxx', city=u'xxxxxx')
Any suggestion on how to solve this?
Thanks.

You have a database record where the country is NULL. The length of that doesn't make sense. What should it do when there's no country set?
Maybe you want to filter the records? SELECT country, city FROM df WHERE country IS NOT NULL? Or maybe lambda l: l.country is not None and len(l.country) < 4, or depending on your logic, lambda l: l.country is None or len(l.country) < 4.

Related

How to create new columns of last 5 sale price off in dataframe

I have a pandas data frame of sneakers sale, which looks like this,
I added columns last1, ..., last5 indicating the last 5 sale prices of the sneakers and made them all None. I'm trying to update the values of these new columns using the 'Sale Price' column. This is my attempt to do so,
for index, row in df.iterrows():
if (index==0):
continue
for i in range(index-1, -1, -1):
if df['Sneaker Name'][index] == df['Sneaker Name'][i]:
df['last5'][index] = df['last4'][i]
df['last4'][index] = df['last3'][i]
df['last3'][index] = df['last2'][i]
df['last2'][index] = df['last1'][i]
df['last1'][index] = df['Sale Price'][i]
continue
if (index == 100):
break
When I ran this, I got a warning,
A value is trying to be set on a copy of a slice from a DataFrame
and the result is also wrong.
Does anyone know what I did wrong?
Also, this is the expected output,
Use this instead of for loop, if you have rows sorted:
df['last1'] = df['Sale Price'].shift(1)
df['last2'] = df['last1'].shift(1)
df['last3'] = df['last2'].shift(1)
df['last4'] = df['last3'].shift(1)
df['last5'] = df['last4'].shift(1)

PySpark RDD filter trouble with inequality

I have an RDD called bank_rdd which has been imported from a CSV file.
First I have split each line separated by a comma into a list
bank_rdd1 = bank_rdd.map(lambda line: line.split(','))
The header titles are:
accountNumber, personFname, personLname, balance
I then removed the header
header = bank_rdd1.first()
bank_rdd1 = bank_rdd1.filter(lambda row: row != header)
Sample data for the first two records as follows:
[('1','John','Smith','1100'),('2','Jane','Doe','500')]
When I run the following code I get a count of 100 (which is the amount of records before I filter)
bank_rdd1.count()
When I run the following code I get a count of 0. Note that x[3] refers to the column that contains bank account balances and it is a string.
bank_rdd1 = bank_rdd1.filter(lambda x: int(x[3]) > 1000)
bank_rdd1.count()
I'm not sure why it is returning a count of 0, when in the CSV file there are 20 rows where the bank account balance is greater than 1000.
Could anybody point out what the error may be?
The following code works just fine for me.
>>> data = spark.sparkContext.parallelize([('1','John','Smith','1100'),('2','Jane','Doe','500')])
>>> data.first()
('1', 'John', 'Smith', '1100')
>>> data.count()
2
>>> data.filter(lambda x: int(x[3]) > 1000).count()
1
Are you sure this is causing the error? Can you share whole code? Can you tell about your pyspark envirnment?

How to return info from different column, same row with Pandas table in Python?

I'm trying to determine the names of the boroughs with the highest/lowest populations and population densities and the largest/smallest areas. This is a screenshot of my table.
I am able to print these values on their own but don't know how to determine the names of the boroughs they correspond to.
maxpop = table.population.max
minpop = table.population.min
maxden = table.density.max
minden = table.density.min
maxar = table.area.max
minar = table.area.min
for item in table:
if int(item['population']) == int(maxpop):
result = item['name']
print(result)
I then get a TypeError because 'string indices must be integers.' But I don't even know if this is the right way to go about it? Please help!!

Iterate a piece of code connecting to API using two variables pulled from two lists

I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.
Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))

pandas - extract values greater than a threshold from a column

I have a DataFrame - a snapshot of which looks like this:
I am trying to grab all the math_score and reading_score values greater than 70 grouped by school_name.
So my end result should look something like this:
I am trying to calculate the % of students with a passing math_score and reading_score which is % of scores > 70.
Any help on how I can go about this?
This is what I have tried:
school_data_grouped = school_data_complete.groupby('school_name')
passing_math_score = school_data_grouped.loc[(school_data_grouped['math_score'] >= 70)]
I get an error with this that says:
AttributeError: Cannot access callable attribute 'loc' of 'DataFrameGroupBy' objects, try using the 'apply' method
What can I do to achive this? Any help is much appreciated.
Thanks!
You can create a column for whether each student passed, for example:
school_data['passed_math'] = school_data['math_score'] >= 70
school_data['passed_both'] = (school_data['math_score'] >= 70) & (school_data['reading_score'] >= 70)
You can then get the pass rate by school using a groupby:
pass_rate = school_data.groupby('school_name').mean()
You need to first filter for math_score & reading_score then apply groupby, because groupby doesn't return a Dataframe.
To work on your question, I got data from this link
DATA
https://www.kaggle.com/aljarah/xAPI-Edu-Data/
I changed column names though.
CODE
import pandas as pd
school_data_df = pd.read_csv('xAPI-Edu-Data 2.csv')
school_data_df.head()
df_70_math_score = school_data_df[school_data_df.math_score > 70]
df_70_reading_math_score = df_70_math_score[df_70_math_score.reading_score >70]
df_70_reading_math_score.head()
grouped_grade = df_70_reading_math_score.groupby('GradeID')
You can do any stats generation from this groupby_object 'grouped_grade'

Categories

Resources