Masking the Zip Codes - python

I'm taking a course and I need to solve the following assignment:
"In this part, you should write a for loop, updating the df_users dataframe.
Go through each user, and update their zip code, to Safe Harbor specifications:
If the user is from a zip code for the which the “Geographic Subdivision” is less than equal to 20,000, change the zip code in df_users to ‘0’ (as a string)
Otherwise, zip should be only the first 3 numbers of the full zip code
Do all this by directly updating the zip column of the df_users DataFrame
Hints:
This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in zip_dict, and setting the zip_code accordingly.
Be very aware of your variable types when working with zip codes here."
Here you can find all the data necessary to understand the context:
https://raw.githubusercontent.com/DataScienceInPractice/Data/master/
assignment: 'A4'
data_files: user_dat.csv, zip_pop.csv
After cleaning the data from user_dat.csv leaving only the columns: 'age', 'zip' and 'gender', and creating a dictionary from zip_pop.csv that contains the population of the first 3 digits from all the zipcodes; I wrote this code:
# Loop through the dataframe's to get each zipcode
for zipcode in df_users['zip']:
# check if the zipcode's 3 first numbers from the dataframe, correspond to a population of more or less than 20.000 people
if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
# if less, change zipcode value to string zero.
df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
else:
# If more, preserve only the first 3 digits of the zipcode.
df_users.loc[df_users['zip'] == zipcode, 'zip'] = zipcode[:len(zipcode) - 2]
This code works halfways and I don't understand why.
It changes the zipcode to 0 if the population is less than 20.000 people, and also changes the first zipcodes (up until the ones that start with '078') but then it returns this error message:
KeyError Traceback (most recent call last)
/var/folders/95/4vh4zhc1273fgmfs4wyntxn00000gn/T/ipykernel_44758/1429192050.py in < module >
1 for zipcode in df_users['zip']:
----> 2 if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
3 df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
4 else:
5 df_users.loc[df_users['zip'] == zipcode, 'zip'] = str(zipcode[:len(zipcode) - 2])
KeyError: '0'
I get that the problem is in the last line of code, because I've been doing every line at a time and each of them worked, until I put that last one. And if I just print the zipcodes instead of that last line, it also works!
Can anyone can help me understand why my code is wrong?

You're modifying a collection of values (i.e. df_users['zip']) whilst you're iterating over it. This is a common anti pattern. If a loop is absolutely required, then you could consider iterating over df_users['zip'].unique() instead. That creates a copy of all the unique zip codes, solving your current error, and it means that you aren't redoing work when you encounter a duplicate zipcode.
If a loop is not required, then there are better (more pandas style) ways to go about your problem. I would suggest something like (untested):
zip_start = df_users['zip'].str[:-2]
df_users['zip'] = zip_start.where(zip_start.map(zip_dict) > 20000, other="0")

Related

How can I use Python and Pandas to parse through text and return the strings I want in separate data cells?

So I have compiled a list of NFL game projections from the 2020 season for fantasy relevant players. Each row contains the team names, score, relevant players and their stats like in the text below. The problem is that each of the player names and stats are either different lengths or written out in slightly different ways.
`Bears 24-17 Jaguars
M.Trubisky- 234/2TDs
D.Montgomery- 113 scrim yards/1 rush TD/4 rec
A.Robinson- 9/114/1
C.Kmet- 3/35/0
G.Minshew- 183/1TD/2int
J.Robinson- 77 scrim yards/1 rush TD/4 rec
DJ.Chark- 3/36`
I'm trying to create a data frame that will split the player name, receptions, yards, and touchdowns into separate columns. Then I will able to compare these numbers to their actual game numbers and see how close the predictions were. Does anyone have an idea for a solution in Python? Even if you could point me in the right direction I'd greatly appreciate it!
You can get split the full string using the '-' (dash/minus sign) as the separator. Then use indexing to get different parts.
Using str.split(sep='-')[0] gives you the name. Here, the str would be the row, for example M.Trubisky- 234/2TDs.
Similarly, str.split(sep='-')[1]gives you everything but the name.
As for splitting anything after the name, there is no way of doing it unless they are in a certain order. If you are able to somehow achieve this, there is a way of splitting into columns.
I am going to assume that the trend here is yards / touchdowns / receptions, in which case, we can again use the str.split() method. I am also assuming that the 'rows' only belong to one team. You might have to run this script once for each team to create a dataframe, and then join all dataframes with a new feature called 'team_name'.
You can define lists and append values to them, and then use the lists to create a dataframe. This snippet should help you.
import re
names, scrim_yards, touchdowns, receptions = [], [], [], []
for row in rows:
# name = row.split(sep='-')[0] --> sample name: M.Trubisky
names.append(row.split(sep='-')[0])
stats = row.split(sep='-')[1].split(sep='/') # sample stats: [234, 2TDs ]
# Since we only want the 'numbers' from each stat, we can filter out what we want using regular expressions.
# This snippet was obtained from [here][1].
numerical_stats = re.findall(r'\b\d+\b', stats) # sample stats: [234, 2]
# now we use indexing again to get desired values
# If the
scrim_yards.append(numerical_stats[0])
touchdowns.append(numerical_stats[1])
receptions.append(numerical_stats[2])
# You can then create a pandas dataframe
nfl_player_stats = pd.DataFrame({'names': names, 'scrim_yards': scrim_yards, 'touchdowns': touchdowns, 'receptions': receptions})
As you are pointing out, often times the hardest part of processing a data file like this is handling all the variability and inconsistency in the file itself. There are a lot of things that can vary inside the file, and then sometimes the file also contains silly errors (typos, missing whitespace, and the like). Depending on the size of the data file, you might be better off simply hand-editing it to make it easier to read into Python!
If you tackle this directly with Python code, then it's a very good idea to be very careful to verify the actual data matches your expectations of it. Here are some general concepts on how to handle this:
First off, make sure to strip every line of whitespace and ignore blank lines:
for curr_line in file_lines:
curr_line = curr_line.strip()
if len(curr_line) > 0:
# Process the line...
Once you have your stripped, non-blank line, make sure to handle the "game" (matchup between two teams) line differently than the lines denoting players"
TEAM_NAMES = [ "Cardinals", "Falcons", "Panthers", "Bears", "Cowboys", "Lions",
"Packers", "Rams", "Vikings" ] # and 23 more; you get the idea
#...down in the code where we are processing the lines...
if any([tn in curr_line for tn in TEAM_NAMES]):
# ...handle as a "matchup"
else:
# ...handle as a "player"
When handling a player and their stats, we can use "- " as a separator. (You must include the space, otherwise players such as Clyde Edwards-Helaire will split the line in a way you did not want.) Here we unpack into exactly two variables, which gives us a nice error check since the code will raise an exception if the line doesn't split into exactly two parts.
p_name, p_stats = curr_line.split("- ")
Handling the stats will be the hardest part. It will all depend on what assumptions you can safely make about your input data. I would recommend being very paranoid about validating that the input data agrees with the assumptions in your code. Here is one notional idea -- an over-engineered solution, but that should help to manage the hassle of finding all the little issues that are probably lurking in that data file:
if "scrim yards" in p_stats:
# This is a running back, so "scrim yards" then "rush TD" then "rec:
rb_stats = p_stats.split("/")
# To get the number, just split by whitespace and grab the first one
scrim_yds = int(rb_stats[0].split()[0])
if len(rb_stats) >= 2:
rush_tds = int(rb_stats[1].split()[0])
if len(rb_stats) >= 3:
rec = int(rb_stats[2].split()[0])
# Always check for unexpected data...
if len(rb_stats) > 3:
raise Exception("Excess data found in rb_stats: {}".format(rb_stats))
elif "TD" in p_stats:
# This is a quarterback, so "yards"/"TD"/"int"
qb_stats = p_stats.split("/")
qb_yards = int(qb_stats[0]) # Or store directly into the DF; you get the idea
# Handle "TD" or "TDs". Personal preference is to avoid regexp's
if len(qb_stats) >= 2:
if qb_stats[1].endswidth("TD"):
qb_td = int(qb_stats[1][:-2])
elif qb_stats[1].endswith("TDs"):
qb_td = int(qb_stats[1][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Handle "int" if it's there
if len(qb_stats) >= 3:
if qb_stats[2].endswidth("int"):
qb_int = int(qb_stats[2][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Always check for unexpected data...
if len(qb_stats) > 3:
raise Exception("Excess data found in qb_stats: {}".format(qb_stats))
else:
# Must be a running back: receptions/yards/TD
rb_rec, rb_yds, rb_td = p_stats.split("/")

Unable to change value of dataframe at specific location

So I'm trying to go through my dataframe in pandas and if the value of two columns is equal to something, then I change a value in that location, here is a simplified version of the loop I've been using (I changed the values of the if/else function because the original used regex and stuff and was quite complicated):
pro_cr = ["IgA", "IgG", "IgE"] # CR's considered productive
rows_changed = 0
prod_to_unk = 0
unk_to_prod = 0
changed_ids = []
for index in df_sample.index:
if num=1 and color="red":
pass
elif num=2 and color="blue":
prod_to_unk += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "unknown"
rows_changed += 1
elif num=3 and color="green":
unk_to_prod += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "productive"
rows_changed += 1
else:
pass
print("Number of productive columns changed to unknown: {}".format(prod_to_unk))
print("Number of unknown columns changed to productive: {}".format(unk_to_prod))
print("Total number of rows changed: {}".format(rows_changed))
So the main problem is the changing code:
df_sample.at[index, "Functionality"] = "unknown" # or productive
If I run this code without these lines of code, it works properly, it finds all the correct locations, tells me how many were changed and what their ID's are, which I can use to validate with the CSV file.
If I use df_sample["Functionality"][index] = "unknown" # or productive the code runs, but checking the rows that have been changed shows that they were not changed at all.
When I use df.at[row, column] = value I get "AttributeError: 'BlockManager' object has no attribute 'T'"
I have no idea why this is showing up. There are no duplicate columns. Hope this was clear (if not let me know and I'll try to clarify it). Thanks!
To be honest, I've never used df.at - but try using df.loc instead:
df_sample.loc[index, "Functionality"] = "unknown"
You can also iat.
Example: df.iat[iTH row, jTH column]

Iterating over a csv file given a specific range

So the problem I'm having is that I'm iterating over a pretty large csv file. startDate and endDate are input given to me by the user and I need to only search in that range.
Although, when I run the program up to that point, it takes a long time to just spit back out "set()" at me. I've pointed where I'm having trouble at in the code
looking for suggestions and possibly sample code, thank you all in advance!
def compare(word1, word2, startDate, endDate):
with open('all_words.csv') as allWords:
readWords = csv.reader(allWords, delimiter=',')
year = set()
for row in readWords:
if row[1] in range(int(startDate), int(endDate)): #< Having trouble here
if row[0] == word1:
year.add(row[1])
print(year)
The reason your test isn't finding any years is that the expression:
row[1] in range(int(startDate), int(endDate))
is checking to see if a string value appears in a list of integers. If you test:
"1970" in range(1960, 1980)
you will see that it returns False. You need to write:
int(row[1]) in range(int(startDate), int(endDate))
However, this is still quite inefficient. It is checking if the value int(row[1]) occurs anywhere in the sequence [int(startDate), int(startDate)+1, ..., int(endDate)], and it's doing it by linear search. Much faster will be:
if int(startDate) <= int(row[1]) < int(endDate):
Note that your code above was written to exclude endDate for the list of possible dates (because range excludes its second argument), and I've done the same above.
Edit: Actually, I guess I should point out that it's only Python 2 where an expression like 500000 in range(1, 1000000) is inefficient. In Python 3 (or in Python 2 with xrange in place of range), it's fast.
You can try read_csv function of pandas library. This function allows you to read a desirable amount of data each time. So you can overcome the size problem.
reader = pd.read_csv(file_name, chunksize=chunk_size, iterator=True)
while True:
try:
df = reader.get_chunk(chunk_size)
# select data rows which have desired dates
except:
break
del df

appending array breaks program

I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Amount"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
line_items.append(item)
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
csv_dict_reader(f_obj)
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
test.append(line_items[i])
monthly_transactions.append(test)
# check to see if the line items year & month match a value already in the monthly_transaction array.
else:
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
monthly_transactions[j].append(line_items[i])
#Otherwise, create a new sub array for that month
else:
monthly_transactions.append(line_items[i])
dateFormat()
dateSeperate()
print(monthly_transactions)
I would really, really appreciate any thoughts or feedback you guys could give me on this code.
Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.

Comparing list items and tuples

In Python 3, I'm trying to create a program which takes input from a user as 3 digit codes and converts them into items in a list. It then compares these items with the first(the 3 digit code) part of a tuple in a list of tuples and prints the whole tuple.
import shares
portfolio_str=input("Please list portfolio: ")
portfolio_str= portfolio_str.replace(' ','')
portfolio_str= portfolio_str.upper()
portfolio_list= portfolio_str.split(',')
print(portfolio_list)
print()
print('{:<6} {:<20} {:>8}'.format('Code', 'Name', 'Price'))
data=shares.EXCHANGE_DATA
for (code, name, share_value) in data:
if code == i in portfolio_list:
print('{:<6} {:<20} {:>8.2f}'.format(code, name, share_value))
else:
print("Failure")
As you can see I'm using a module called shares containing a list of tuples called EXCHANGE_DATA which is set out like this:
EXCHANGE_DATA = [('AIA', 'Auckair', 1.50),
('AIR', 'Airnz', 5.60),
('AMP', 'Amp',3.22),
('ANZ', 'Anzbankgrp', 26.25),
('ARG', 'Argosy', 12.22),
('CEN', 'Contact', 11.22),
('CNU', 'Chorus',3.01),
('DIL', 'Diligent', 5.3),
('DNZ', 'Dnz Property', 2.33),
('EBO', 'Ebos', 1.1),
An exemplar input would be:
AIA, AMP, ANZ
The corresponding output would be:
Code Name Price
AIA Auckair 1.50
AMP Amp 3.22
ANZ Anzbankgrp 26.25
I'm just stuck on the for and/or if statements which I think I need.
Your issue is this here:
if code == i in portfolio_list:
This doesn't make sense in Python. in checks if a given value is contained in the list, so this checks if i is in portfolio_list, then checks if code is equal to True or False (whatever i in portfolio_list returned. What you want is simply:
if code in portfolio_list:
Note that if portfolio_list could be long, it might be worth making it a set, as checking for membership in a set is significantly more efficient for large amounts of data.
Your syntax appears to be a mashup of different methodologies. You might have meant:
if any(code == i for i in portfolio_list):
However, as this is directly equivalent to code in portfolio_list, but more verbose and inefficient, it's not a good solution.

Categories

Resources