Unable to merge two dataframes using pandas - python

Despite multiple attempts I am not succeeding in doing a simple merge operation on two dataframes. Below code returns
KeyError: 'CODE'
on the merge function.
Note 1: To make the post reproducible, StringIO is used here with only two lines within each CSV, but in real life I read from files with thousands of records.
Note 2: Notice the trailing ',' (separator) at the end of each line: my CSV files are badly formatted but this is how actual files are.
Note 3: I am using Python 2.7
from StringIO import StringIO
import pandas as pd
master = StringIO("""N-NUMBER,SERIAL NUMBER,MFR MDL CODE,ENG MFR MDL,YEAR MFR,TYPE REGISTRANT,NAME,STREET,STREET2,CITY,STATE,ZIP CODE,REGION,COUNTY,COUNTRY,LAST ACTION DATE,CERT ISSUE DATE,CERTIFICATION,TYPE AIRCRAFT,TYPE ENGINE,STATUS CODE,MODE S CODE,FRACT OWNER,AIR WORTH DATE,OTHER NAMES(1),OTHER NAMES(2),OTHER NAMES(3),OTHER NAMES(4),OTHER NAMES(5),EXPIRATION DATE,UNIQUE ID,KIT MFR, KIT MODEL,MODE S CODE HEX,
1 ,1071 ,3980115,54556,1988,5,FEDERAL AVIATION ADMINISTRATION ,WASHINGTON REAGAN NATIONAL ARPT ,3201 THOMAS AVE HANGAR 6 ,WASHINGTON ,DC,20001 ,1,001,US,20160614,19900214,1T ,5,5 ,V ,50000001, ,19880909, , , , , ,20191130,00524101, , ,A00001 ,""")
mfr = StringIO("""CODE,MFR,MODEL,TYPE-ACFT,TYPE-ENG,AC-CAT,BUILD-CERT-IND,NO-ENG,NO-SEATS,AC-WEIGHT,SPEED,
3980115,EXLINE ACE-C ,ACE-C ,4,1 ,1,1,01,001,CLASS 1,0082,""")
masterdf = pd.read_csv(master,sep=",",index_col=False)
mfrdf = pd.read_csv(mfr,sep=",",index_col=False)
masterdf.merge(mfrdataframe,left_on='MFR MDL CODE',right_on='CODE', how='inner')

I think that the problem is the name of the dataframe you're passing to merge: mfrdataframe should instead be mfrdf.

Related

How to delete icons from comments in csv files using pandas

I am try to delete an icons which appears in many rows of my csv file. When I create a dataframe object using pd.read_csv it shows a green squared check icon, but if I open the csv using Excel I see ✅ instead. I tried to delete using split function because the verification status is separated by | to the comment:
df['reviews'] = df['reviews'].apply(lambda x: x.split('|')[1])
I noticed it didn't detect the "|" separator when the review contains the icon mentioned above.
I am not sure if it is an encoding problem. I tried to add encoding='utf-8' in pandas read_csv but It didn't solve the problem.
Thanks in advance.
I would like to add, this is a pic when I open the csv file using Excel.
You can remove non-latin characters using encode/decode methods:
>>> df
reviews
0 ✓ Trip Verified
1 Verified
>>> df['reviews'].str.encode('latin1', errors='ignore').str.decode('latin1')
0 Trip Verified
1 Verified
Name: reviews, dtype: object
Say you had the following dataframe:
reviews
0 ✅ Trip Verified
1 Not Verified
2 Not Verified
3 ✅ Trip Verified
You can use the replace method to replace the ✅ symbol which is unicode character 2705.
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
Here is the full example:
Code:
import pandas as pd
df = pd.DataFrame({"reviews":['\u2705 Trip Verified', 'Not Verified', 'Not Verified', '\u2705 Trip Verified']})
df['reviews'] = df['reviews'].apply(lambda x: x.replace('\u2705',''))
print(df)
Output:
reviews
0 Trip Verified
1 Not Verified
2 Not Verified
3 Trip Verified

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Python: Average values in a CSV file based on value of another column

I am a noob and I have a large CSV file with data structured like this (with a lot more columns):
State daydiff
CT 5.5
CT 6.5
CT 6.25
NY 3.2
NY 3.225
PA 7.522
PA 4.25
I want to output a new CSV where the daydiff is averaged for each State like this:
State daydiff
CT 6.083
NY 3.2125
PA 5.886
I have tried numerous ways and the cleanest seemed to leverage pandas groupby but when i run the code below:
import pandas as pd
df = pd.read_csv('C:...input.csv')
df.groupby('State')['daydiff'].mean()
df.to_csv('C:...AverageOutput.csv')
I get a file that is identical to the original file but with a counter added in the first column with no header:
,State,daydiff
0,CT,5.5
1,CT,6.5
2,CT,6.25
3,NY,3.2
4,NY,3.225
5,PA,7.522
6,PA,4.25
I was also hoping to control the new average in datediff to a decimal going out only to the hundredths. Thanks
The "problem" with the counter is because the default behaviour for to_csvis to write the index. You should do df.to_csv('C:...AverageOutput.csv', index=False).
You can control the output format of daydiff by converting it to string. df.daydiff = df.daydiff.apply(lambda x: '{:.2f}'.format(x))
Your complete code should be:
df = pd.read_csv('C:...input.csv')
df2 = df.groupby('State')['daydiff'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv('C:...AverageOutput.csv')

Pandas import cvs: pandas.io.common.EmptyDataError: No columns to parse from file

my data in my file look like this
link availability product_type
1 1016842-5 "GlamWhite Home Bleaching Refill Kit (6% Wasserstoffperoxid)"
1 1045231-4 "Cabernet Sauvignon Burgenland Weingut Erich Scheiblhofer 2011 - 75cl"
1 1045232-4 "Blaufränkisch Ried Oberer Wald Burgenland Ernst Triebaumer 2009 - 75cl"
And I am using pandas trying to read the csv with this:
import csv
import pandas as pd
file_path = '/Users/nasiantalla/Downloads/pdsfeed (2).csv'
data = pd.read_csv(file_path,error_bad_lines=False,skiprows=1066576,sep='\t',lineterminator='\r', encoding='utf-8',header=0,
usecols=['availability', 'link'])
However I still get error:
pandas.io.common.EmptyDataError: No columns to parse from file
I don't understand, I tried all the different encodings, but no luck.. Do you see something I haven't noticed?
Thanks!
try without skiprows=1066576,sep='\t',lineterminator='\r'

CSV-like data in script to Pandas DataFrame

I've got a list of cities with associated lon,lat values that I'd like to turn into a DataFrame, but instead of reading from a CSV file, I want to have the user modify or add to these city,lat,lon values into a cell in an IPython notebook. Right now I have this solution that works, but it seems a bit ugly:
import pandas as pd
sta = array([
('Boston', 42.368186, -71.047984),
('Provincetown', 42.042745, -70.171180),
('Sandwich', 41.767990, -70.466219),
('Gloucester', 42.610253, -70.660570)
],
dtype=[('City','|S20'), ('Lat','<f4'), ('Lon', '<f4')])
# Create a Pandas DataFrame
obs = pd.DataFrame.from_records(sta,index='City')
print(obs)
Lat Lon
City
Boston 42.368187 -71.047981
Provincetown 42.042744 -70.171181
Sandwich 41.767990 -70.466217
Gloucester 42.610252 -70.660568
Is there a clearer, safer way to create the DataFrame?
I'm thinking that folks will forget the parenthesis, add a closing ',' on the last line, etc.
Thanks,
Rich
You could just create a big multiline string that they edit, then use read_csv to read it from a StringIO object:
x = """
City, Lat, Long
Boston, 42.4, -71.05
Provincetown, 42.04, -70.12
"""
>>> pandas.read_csv(StringIO.StringIO(x.strip()), sep=",\s*")
City Lat Long
0 Boston 42.40 -71.05
1 Provincetown 42.04 -70.12
Of course, people can still make errors with this (e.g., inserting commas), but the format is simpler.

Categories

Resources