How to search values in a dictionary in python - python

I have a big csv files with the following format:
CSV FILE 1
id, person, city
1, John, NY
2, Lucy, Miami
3, Smith, Los Angeles
4, Mike, Chicago
5, David, Los Angeles
6, Daniel, NY
On another CSV file I have each city with a numerical code:
CSV FILE 2
city , code
NY , 100
Miami, 101
Los Angeles, 102
Chicago, 103
What I need to do is go through CSV File 1 in the city column, read the name of the city and get the numerical code for that city from CSV File 2. I could then just output that list of city codes to a text file. For this example I would get this result:
100
101
102
103
102
100
I used csv.DictReader to create dictionaries for each file but I am stuck trying to find a way to map each city to each code.
Any ideas or pointers in the right direction would be appreciated!

You have some extra whitespace there, and unlike some storage formats, CSV does care about it. If that is actually in your source data, you may have to strip it out before it will be processed as you expect (otherwise various fields will have leading and trailing whitespace).
Assuming that the whitespace is gone, however, it's fairly straightforward to do. You can just create a dictionary mapping names to codes, based on the contents of your second file.
from csv import DictReader
city_codes = {}
for row in DictReader(open('file2.csv', 'rb')):
city_codes[row['city']] = row['code']
for row in DictReader(open('file1.csv', 'rb')):
print city_codes[row['city']]
Naturally, you can send this out to a text file as you wish, simply by redirecting the output of print as you usually would.

In addition to what Jeremy suggested, you could use the string method .strip() to remove the trailing and leading whitespace automatically.

Consider using sqlite3. You can then do efficient, simple and powerful joins.
If files are really big, you can benefit from creating proper index.

Related

pandas row manipulation - If startwith keyword found - append row to end of previous row

I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot

Converting unordered list of tuples to pandas DataFrame

I am using the library usaddress to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e.g. street, city, state) and rows represent each individual address I've extracted. For example:
Suppose I have a list of addresses:
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
and I extract them using usaddress
info = [usaddress.parse(loc) for loc in addr]
"info" is a list of a list of tuples that looks like this:
[[('123', 'AddressNumber'),
('Pennsylvania', 'StreetName'),
('Ave', 'StreetNamePostType'),
('NW', 'StreetNamePostDirectional'),
('Washington', 'PlaceName'),
('DC', 'StateName'),
('20008', 'ZipCode')],
[('652', 'AddressNumber'),
('Polk', 'StreetName'),
('St', 'StreetNamePostType'),
('San', 'PlaceName'),
('Francisco,', 'PlaceName'),
('CA', 'StateName'),
('94102', 'ZipCode')],
[('3711', 'AddressNumber'),
('Travis', 'StreetName'),
('St', 'StreetNamePostType'),
('#', 'OccupancyIdentifier'),
('800', 'OccupancyIdentifier'),
('Houston,', 'PlaceName'),
I would like each list (there are 3 lists within the object "info") to represent a row, and the 2 value of each tuple pair to denote a column and the 1 value of the tuple pair to be the value. Note: the link of the inner lists will not always be the same as not every address will have every bit of information.
Any help would be much appreciated!
Thanks
Not sure if there is a DataFrame constructor that can handle info exactly as you have it now. (Maybe from_records or from_items?--still don't think this structure would be directly compatible.)
Here's a bit of manipulation to get what you're looking for:
cols = [j for _, j in info[0]]
# Could use nested list comprehension here, but this is probably
# more readable.
info2 = []
for row in info:
info2.append([i for i, _ in row])
pd.DataFrame(info2, columns=cols)
AddressNumber StreetName StreetNamePostType StreetNamePostDirectional PlaceName StateName ZipCode
0 123 Pennsylvania Ave NW Washington DC 20008
1 652 Polk St San Francisco, CA 94102
Thank you for your responses! I ended up doing a completely different workaround as follows:
I checked the documentation to see all possible parse_tags from usaddress, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex. Code below!
parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])
Then I created a new column that made a string out of the usaddress parse list and called it "Info"
df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))
Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!
for colname in parse_tags:
df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
colname, x) else "")
This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!

Trying to join a .dat file with a .asc file using Python or Excel

guys.
I've got a bit of a unique issue trying to merge two big data files together. Both files have a column of the same data (patent number) with all other columns different.
The idea is to join them such that these patent number columns align so the other data is readable and connected.
Just the first few lines of the .dat file looks like:
IL 1 Chicago 10030271 0 3930271
PA 1 Bedford 10156902 0 3930272
MO 1 St. Louis 10112031 0 3930273
IL 1 Chicago 10030276 0 3930276
And the .asc:
02 US corporation No change 11151713 TRANSCO PROD INC 58419
02 US corporation No change 11151720 SECURE TELECOM INC 502530
02 US corporation No change 11151725 SOA SYSTEMS INC 520365
02 US corporation No change 11151738 REVTEK INC 473150
The .dat file is too large to open fully in Excel so I don't think reorganizing it there is an option (rather I don't know if it is or not through any macros I've found online yet).
Quite a newbie question I feel but does anyone know how I could link these data sets together (preferably using Python) with this patent number unique identifier?
You will want to write a program that reads in the data from the two files you would like to merge. You will open the file and parse the data for each line. From there you are able to write the data to a new file in any order that you would like. This is accomplish-able through python file IO.
pseudo code:
def filehandler(self, filename1, filename2):
Fd =open(filename1, "r")
Fd2 = open(filename2, "r")
while True:
line1 = Fd.readline()
if not line1: break # this will exit the loop if there is no more to read
Line1_array = line1.split()
# first line of first file is split and saved in an array deliniated by spaces.

Pandas not recognizing csv columns

I am using pandas to read .csv data files. For one of my files I am able to index using the column title. For the other I get error messages
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py",
line 1023, in _check_have
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named State'
The code I used is:
filename = "PovertyEstimates.csv"
#filename = "nm.csv"
f = open(filename)
import pandas as pd
data = pd.read_csv(f)#, index_col=0)
print data['State']
Even when I use index_col I get the same error(unless it is 0). I have found that when I print the csv file that isn't working in my terminal it is not separated into columns like the one that is. Rather the items in each row are printed consecutively separated by spaces. I believe this incorrect separation is the problem.
I am using LibreOffice Calc on Ubuntu Linux. For the improperly formatted file (which appears in perfect format in LibreOffice) the terminal output is:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3194 entries, 0 to 3193
Data columns:
FIPStxt State Area_name Rural-urban_Continuum Code_2003 Urban_Influence_Code_2003 Rural-urban_Continuum Code_20013 Urban_Influence_Code_20013 POVALL_2011 CI90LBAll_2011 CI90UBALL_2011 PCTPOVALL_2011 CI90LBALLP_2011 CI90UBALLP_2011 POV017_2011 CI90LB017_2011 CI90UB017_2011 PCTPOV017_2011 CI90LB017P_2011 CI90UB017P_2011 POV517_2011 CI90LB517_2011 CI90UB517_2011 PCTPOV517_2011 CI90LB517P_2011 CI90UB517P_2011 MEDHHINC_2011 CI90LBINC_2011 CI90UBINC_2011 POV05_2011 CI90LB05_2011 CI90UB05_2011 PCTPOV05_2011 CI90LB05P_2011 CI90UB05P_2011 3194 non-null values
dtypes: object(1)
The first few lines of the csv file are:
FIPStxt State Area_name Rural-urban_Continuum Code_2003
01000 AL Alabama
01001 AL Autauga County 2 2
01003 AL Baldwin County 4 5
The spaces are probably the problem. You need to tell pandas what separator to use when parsing the CSV.
data = pd.read_csv(f, sep=" ")
Problem is though, it will pick up all spaces as valid separators (e.g. Alabama County becomes 2 columns). The best would be to convert that one file to a an actual comma (semicolon or other) separated file or make sure that compound values are quoted ("Alabama County") and then specify the quotechar:
data = pd.read_csv(f, sep=" ", quotechar='"')

CSV-like data in script to Pandas DataFrame

I've got a list of cities with associated lon,lat values that I'd like to turn into a DataFrame, but instead of reading from a CSV file, I want to have the user modify or add to these city,lat,lon values into a cell in an IPython notebook. Right now I have this solution that works, but it seems a bit ugly:
import pandas as pd
sta = array([
('Boston', 42.368186, -71.047984),
('Provincetown', 42.042745, -70.171180),
('Sandwich', 41.767990, -70.466219),
('Gloucester', 42.610253, -70.660570)
],
dtype=[('City','|S20'), ('Lat','<f4'), ('Lon', '<f4')])
# Create a Pandas DataFrame
obs = pd.DataFrame.from_records(sta,index='City')
print(obs)
Lat Lon
City
Boston 42.368187 -71.047981
Provincetown 42.042744 -70.171181
Sandwich 41.767990 -70.466217
Gloucester 42.610252 -70.660568
Is there a clearer, safer way to create the DataFrame?
I'm thinking that folks will forget the parenthesis, add a closing ',' on the last line, etc.
Thanks,
Rich
You could just create a big multiline string that they edit, then use read_csv to read it from a StringIO object:
x = """
City, Lat, Long
Boston, 42.4, -71.05
Provincetown, 42.04, -70.12
"""
>>> pandas.read_csv(StringIO.StringIO(x.strip()), sep=",\s*")
City Lat Long
0 Boston 42.40 -71.05
1 Provincetown 42.04 -70.12
Of course, people can still make errors with this (e.g., inserting commas), but the format is simpler.

Categories

Resources