Python Optimizations for basic CSV file - python

I have a pseudo CSV file (separated instead with pipes); there are two columns, the first column header is location and has no relevance to the problem. The second of the two columns is an identifier (in this case username). The file looks something like this
Location | Username
San Francisco, CA | sam001040
Chicago, IL | tinytom
New York City, NY | coder23
Palo Alto, CA | sam001040
As you can notice, sam001040, am seen in two cities (San Francisco & Palo Alto).
I need to assign a unique identification number to the username and create a new similarly formatted table with the new id number. The mappings (username -> id) should be stored to disk. The mappings should be stored, because if in a few days I might need to process another file, so I can reuse the previously stored mappings.
So after the id process the file should look like
Location | Username | UniqueID
San Francisco, CA | sam001040 | 0
Chicago, IL | tinytom | 1
New York City, NY | coder23 | 2
Palo Alto, CA | sam001040 | 0
A few days later a file like this can come in
Location | Username
Grand Rapids, MI | gowolves
Chicago, IL | ill
Los Angeles, CA | trojans
Castro Valley, CA | coder23
Since there are some new usernames, new identifiers need to be created and one we saw from the last time. So the new file outputted should look like this
Location | Username | UniqueID
Grand Rapids, MI | gowolves | 3
Chicago, IL | illini | 4
Los Angeles, CA | trojans | 5
Castro Valley, CA | coder23 | 2
Here is a link to the code, there are some comments and hopefully the names are helpful, but I can clarify anything.
A couple caveats
The file I am manipulating is 1.3gb, approximately 20,000,000 rows with about 30% duplication in usernames (translating to 14,000,000 keys in the dictionary)
Currently only have access to my local machine (MBP, 8 gb ram, 512 flash memory)
Additional Info / What I've tried so far
Initially I used for loops in python, then realized that's not good practice, switched over to pandas dataframes accordingly and used lambdas
Was writing to another file, then decided to print to console and redirected to another file (using >)
Tried to process the file as a whole, which always caused something to break and once used up 500 gb of memory (don't know how that happened)
Broke up the large 1.3 gb file into 50 smaller ones, each one takes ~3 hrs to process
Tried pickling before, then switched to json to store the dictionary after reading Pickle vs. Json (link in comments)
I ran a profiler (SnakeViz) and here are the results. From my understanding it seems like checking the dictionary for keys is taking up time, but from my understanding after reading another stackoverflow post "in" the generally the fastest (Most efficient method to check if dictionary key exists and process its value if it does)
Main Question -
Am I doing something completely wrong? I've spent the entire week looking at this and am not sure what more to do. I didn't think it would take on the scope of ~150 hours to process everything.
If anyone has any suggestions or different ideas, please let me know! This is my first post, so if I need to include more info (or remove some) I apologize in advanced and will adjust the post accordingly.

In general, when checking if a key is in a dictionary, do k in d, not k in d.items() which is dramatically slower, e.g.
In [68]: d = {x:x+1 for x in range(100000)}
In [69]: %timeit (67 in d)
10000000 loops, best of 3: 39.2 ns per loop
In [70]: %timeit (67 in d.items())
100 loops, best of 3: 10.8 ms per loop
That alone would make a big difference. But, I would use a pattern more like this, which should speed things up some more. .map looks up the id for existing users, and the .unique() gets the set of new usernames (filtering to those not matched in the lookup table).
df['UserId'] = df['Username'].map(segment_dict)
new_users = df[pd.isnull(df['UserId'])]['Username'].unique()
for u in new_users:
segment_dict[u] = unique_ids
unique_ids += 1

You could try keeping the User -> ID mapping in CSV for use in pandas.
Assuming you have a CSV file mapping known usernames to IDs:
$ cat ids.csv
sam001040,0
tinytom,1
coder23,2
And a new file newfile.txt that you need to process:
$ cat newfile.txt
Location | Username
Grand Rapids, MI | gowolves
Chicago, IL | ill
Los Angeles, CA | trojans
Castro Valley, CA | coder23
You read in ids.csv:
ids = pd.read_csv('ids.csv', header=None, index_col=0, names=['Username', 'ID'])
and newfile.txt:
newfile = pd.read_csv('newfile.txt', sep=' \| ', skipinitialspace=True)
# or pd.read_csv('newfile.txt', sep='|'), which is faster, but won't work nice
# when the file has spaces like you show
Now you can do:
newfile_with_ids = newfile.merge(ids, left_on='Username', right_index=True, how='left')
All known IDs are already filled in:
Location Username ID
0 Grand Rapids, MI gowolves NaN
1 Chicago, IL ill NaN
2 Los Angeles, CA trojans NaN
3 Castro Valley, CA coder23 2
Now, add new IDs:
mask = newfile_with_ids['ID'].isnull()
ids = pd.concat([ids, pd.DataFrame(
data={'ID': 1 + int(ids.iloc[-1]) + np.arange(mask.sum())},
index=newfile_with_ids.loc[mask, 'Username'].drop_duplicates())])
to get:
ID
Username
sam001040 0
tinytom 1
coder23 2
gowolves 3
ill 4
trojans 5
Then write new IDs to the dataframe:
newfile_with_ids.loc[mask, 'ID'] = ids.loc[
newfile_with_ids.loc[mask, 'Username'], 'ID'].values
And finally you have:
Location Username ID
3 Castro Valley, CA coder23 2
0 Grand Rapids, MI gowolves 3
1 Chicago, IL ill 4
2 Los Angeles, CA trojans 5
Finally, save the new ids back and continue.

Related

How to write a Function in python pandas to append the rows in dataframe in a loop?

I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+

How to Insert several separate characters in a sqlite3 cell?

I want to insert several different values ​​in just one cell
E.g.
Friends' names
ID | Grade | Names
----+--------------+----------------------------
1 | elementary | Kai, Matthew, Grace
2 | guidance | Eli, Zoey, David, Nora, William
3 | High school | Emma, James, Levi, Sophia
Or as a list or dictionary:
ID | Grade | Names
----+--------------+------------------------------
1 | elementary | [Kai, Matthew, Grace]
2 | guidance | [Eli, Zoey, David, Nora, William]
3 | High school | [Emma, James, Levi, Sophia]
or
ID | Grade | Names
----+--------------+---------------------------------------------
1 | elementary | { a:Kai, b:Matthew, c:Grace}
2 | guidance | { a:Eli, b:Zoey, c:David, d:Nora, e:William}
3 | High school | { a:Emma, b:James, c:Levi, d:Sophia}
Is there a way?
Yes there is a way, but that doesn't mean you should do it this way.
You could for example save your values as a json string and save them inside the column. If you later want to add a value you can simply parse the json, add the value and put it back into the database. (Might also work with a BLOB, but I'm not sure)
However, I would not recommend saving a list inside of a column, as SQL is not meant to be used like that.
What I would recommend is that you have a table and for every grade with its own primary key. Like this:
ID
Grade
1
Elementary
2
Guidance
3
High school
And then another table containing all the names, having its own primary key and the gradeId as its secondary key. E.g:
ID
GradeID
Name
1
1
Kai
2
1
Matthew
3
1
Grace
4
2
Eli
5
2
Zoey
6
2
David
7
2
Nora
8
2
William
9
3
Emma
10
3
James
11
3
Levia
12
3
Sophia
If you want to know more about this, you should read about Normalization in SQL.

Split lists with uncertain elements into different categories (using pandas)

I am having trouble with a pandas split. So I have a column of data that looks something like this:
Initial Dataframe
index | Address
0 | [123 New York St]
1 | [Amazing Building, 23 New Jersey St, 2F]
2 | [98 New Mexico Ave, 16F]
3 | [White House, 1600 Pennsylvania Ave, PH]
4 | [221 Baker Street]
5 | [Hogwarts]
As you can see, the list contains varying categories and number of elements. Some have building names along with addresses. Some only have addresses with building floors. I want to sort them out by category (building name, address, unit/floor number) but I'm having trouble coming up with a solution to this, as I'm a beginner python & pandas learner.
How do I split the addresses into different categories to get an output that looks like this, assuming the building names ALL start with an alphabet and I can put Null for categories with missing value?
Desired Output:
index | Building Name | Address | Unit Number
0 | Null | 123 New York St | Null
1 | Amazing Building | 23 New Jersery St. | 2F
2 | Null | 98 New Mexico Ave. | 16F
3 | White House | 1600 Pennsylvania Ave | PH
4 | Null | 221B Baker St | Null
5 | Hogwarts | Null | Null
The main thing I need is for all addresses to be in the Address Column. Thanks for any help!
preconditional condition : Building name starts with a character, not a number
If the building name starts with a number, the wrong result can be output.
import pandas as pd
df = pd.DataFrame({'addr' : ['123 New York St',
'Amzing Building, 23 New Jersey St, 2F',
'98 New Mexico Ave, 16F']})
# Check the number of items in the address value
df['addr'] = df['addr'].str.split(',')
df['cnt'] = df['addr'].apply(lambda x: len(x)).values
# function, Building name start letter check
def CheckInt(s):
try:
int(s[0])
return True
except ValueError:
return False
for i, v in df.iterrows():
# One item of address value
if v.cnt == 1:
df.loc[i,'Address'] = v.addr
# Three items of address value
elif v.cnt == 3:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
df.loc[i,'Unit'] = v.addr[2]
# Two items of address value
else:
if CheckInt(v.addr[0]):
df.loc[i,'Address'] = v.addr[0]
df.loc[i,'Unit'] = v.addr[1]
else:
df.loc[i,'Building'] = v.addr[0]
df.loc[i,'Address'] = v.addr[1]
We can get the output for your input dataframe as below.
If the data is different, you may have to tinker around.
df['com_Address'] = df[' Address'].apply(lambda x: x.replace('[','').replace(']','')).str.split(',')
st_list= ['St','Ave']
df['St_Address']=df.apply(lambda x: [a if st in a else '' for st in st_list for a in x['com_Address']],axis=1)
df['St_Address']=df['St_Address'].apply(lambda x:[i for i in x if i]).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: [x['com_Address'][0] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Building Name']=df.apply(lambda x: np.where((len(x['com_Address'])==1) & (x['St_Address']==''),x['com_Address'][0],x['Building Name']),axis=1)
df['Unit Number']=df.apply(lambda x: [x['com_Address'][2] if len(x['com_Address'])==3 else 'Null'],axis=1).astype(str).apply(lambda x: x.strip("[]'"))
df['Unit Number']=df.apply(lambda x: np.where((len(x['com_Address'])==2) & (x['St_Address']!=''),x['com_Address'][-1],x['Unit Number']),axis=1)
df
Column "com_Address" is optional. I had to create it because the 'Address' from your input came to me as a string & not as a list. If you already have it as list, you don't need this & you will have to update "com_Address" with 'Address' in the code.
Output
index Address com_Address Building Name St_Address Unit Number
0 0 [123 New York St] [ 123 New York St] Null 123 New York St Null
1 1 [Amazing Building, 23 New Jersey St, 2F] [ Amazing Building, 23 New Jersey St, 2F] Amazing Building 23 New Jersey St 2F
2 2 [98 New Mexico Ave, 16F] [ 98 New Mexico Ave, 16F] Null 98 New Mexico Ave 16F
3 3 [White House, 1600 Pennsylvania Ave, PH] [ White House, 1600 Pennsylvania Ave, PH] White House 1600 Pennsylvania Ave PH
4 4 [221 Baker Street] [ 221 Baker Street] Null 221 Baker Street Null
5 5 [Hogwarts] [ Hogwarts] Hogwarts Null

How do I check if a column value is in a dictionary, if it is replace another column value with the dictionary value? [duplicate]

This question already has answers here:
Remap values in pandas column with a dict, preserve NaNs
(11 answers)
Closed 3 years ago.
I have the following Data Frame:
Company ID Company Name State
100 Apple CA
100 Apl CA
200 Amazon WA
200 Amz WA
300 Oracle CA
300 Oracle CA
And the following dictionary:
{100: "Apple, Inc.", 200: "Amazon, Inc."}
*Note that Oracle isn't in the dictionary
What I want to do is replace the values in the Company Name column with the dictionary value if the key is present in the Company ID column, otherwise leave the value the same. So the output would be:
Company ID Company Name State
100 Apple, Inc. CA
100 Apple, Inc. CA
200 Amazon, Inc. WA
200 Amazon, Inc. WA
300 Oracle CA
300 Oracle CA
I am looking to do something similar to .replace, however instead of replacing the value in column, I would like to replace the value in another column.
I know this is similar to a .replace method, but I'm having trouble having the key and values be in different columns. Any help would be very appreciated!!
Use map with .fillna:
df['Company Name'] = df['Company ID'].map(dct).fillna(df['Company Name'])
Company ID Company Name State
0 100 Apple, Inc. CA
1 100 Apple, Inc. CA
2 200 Amazon, Inc. WA
3 200 Amazon, Inc. WA
4 300 Oracle CA
5 300 Oracle CA

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Categories

Resources