CSV-like data in script to Pandas DataFrame - python

I've got a list of cities with associated lon,lat values that I'd like to turn into a DataFrame, but instead of reading from a CSV file, I want to have the user modify or add to these city,lat,lon values into a cell in an IPython notebook. Right now I have this solution that works, but it seems a bit ugly:
import pandas as pd
sta = array([
('Boston', 42.368186, -71.047984),
('Provincetown', 42.042745, -70.171180),
('Sandwich', 41.767990, -70.466219),
('Gloucester', 42.610253, -70.660570)
],
dtype=[('City','|S20'), ('Lat','<f4'), ('Lon', '<f4')])
# Create a Pandas DataFrame
obs = pd.DataFrame.from_records(sta,index='City')
print(obs)
Lat Lon
City
Boston 42.368187 -71.047981
Provincetown 42.042744 -70.171181
Sandwich 41.767990 -70.466217
Gloucester 42.610252 -70.660568
Is there a clearer, safer way to create the DataFrame?
I'm thinking that folks will forget the parenthesis, add a closing ',' on the last line, etc.
Thanks,
Rich

You could just create a big multiline string that they edit, then use read_csv to read it from a StringIO object:
x = """
City, Lat, Long
Boston, 42.4, -71.05
Provincetown, 42.04, -70.12
"""
>>> pandas.read_csv(StringIO.StringIO(x.strip()), sep=",\s*")
City Lat Long
0 Boston 42.40 -71.05
1 Provincetown 42.04 -70.12
Of course, people can still make errors with this (e.g., inserting commas), but the format is simpler.

Related

Header Information does not match array. deletechars argument

I am trying to create a numpy array from a geosoft formatted xyz text file. The format uses a '/' to start the header file, then space delimited after. I believe numpy sees the '/' and assigns as column zero.
Header info looks like this
/ Line Aircraft Flight Date JulL Time DateU TimeU Zn Easting Northing Lat Long xTrack ZFid_ms KFid AFid TFRAWT TFUNCT Mag4D VecX VecY VecZ VecTF MagRatio GPSHt Undul Sats HDop DGPS RadAlt BaroHPa Temp Humid CurrAmps Dn Up Samp Live RawTC RawK RawU RawTh Cosm LRange LStr LErr
data = np.genfromtxt(filename, deletechars="/", usecols=(0,18,20,21,22,23),invalid_raise=False, names=True,dtype=None)
data.dtype.names results in below, first column did not create a valid column name
('f0', 'TFRAWT', 'Mag4D', 'VecX', 'VecY', 'VecZ')
I thought deletechars was built for this purpose? Am I not using it correctly?

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Cannot convert object type to string; and then filter on that string; python pandas dataframe

I am trying to pull all stock tickers from NYSE, and then filter out for only those with MarketCap above 5B.
I am running into a problem because based on how my data load comes in all columns are data type "Object" and I cannot find anyway to convert them to anything else. See my code and comments below:
import pandas as pd
import numpy as np
# NYSE
url_nyse = "http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download"
df = pd.DataFrame.from_csv(url_nyse)
df = df.drop(df.columns[[0, 1, 3, 6,7]], axis=1)
This is my initial data load of NYSE stocks, and then I filter for just MarketCap, Sector, and Industry.
At first I was hoping to filter out MarketCap first by anything with "M" in it was removed and then removing the first and last characters to get a number which then could be filtered to keep anything above 5. However I think it is because of the data types being "Object" and not string I have not bee able to do it directly. So I then created new columns that would contain only letters or numbers, hoping that I could then convert to data type string and float from there.
df['MarketCap_Num'] = df.MarketCap.str[1:-1]
df['Billion_Filter'] = df.MarketCap.str[-1:]
So MarketCap_Num column has only the numbers by removing the first and last characters and Billion_Filter is only the last character where I will remove any value that = M.
However even though these columns are just numbers or just strings I CANNOT find anyway to convert to change from object datatype so then my filtering is not working at all. Any help is much appreciated.
I have tried .astype(float), pd.to_numeric, type functions to no success.
My filtering code would then be:
df[df.Billion_Filter.str.contains("B")]
But when I run that nothing happens, no error but also no filter happens. When I run this code on a different table it works, so it must be the object data type that is holding it up.
Convert the MarketCap column into floats by first removing the dollar signs and then substituting B with e9 and M with e6. This should make it easy to use .astype(float) on the column to do the conversion.
import pandas as pd
import numpy as np
# NYSE
url_nyse = "http://www.nasdaq.com/screening/companies-by-name.aspx?letter=0&exchange=nyse&render=download"
df = pd.DataFrame.from_csv(url_nyse)
df = df.drop(df.columns[[0, 1, 3, 6,7]], axis=1)
df = df.replace({'MarketCap': {'\$': '', 'B': 'e9', 'M': 'e6', 'n/a': np.nan}}, regex=True)
df.MarketCap = df.MarketCap.astype(float)
print(df[df.MarketCap > 5000000000].head(10))
Yields:
MarketCap Sector industry
Symbol
MMM 1.419900e+11 Health Care Medical/Dental Instruments
WUBA 1.039000e+10 Technology Computer Software: Programming, Data Processing
ABB 5.676000e+10 Consumer Durables Electrical Products
ABT 9.887000e+10 Health Care Major Pharmaceuticals
ABBV 1.563200e+11 Health Care Major Pharmaceuticals
ACN 9.388000e+10 Miscellaneous Business Services
AYI 7.240000e+09 Consumer Durables Building Products
ADNT 7.490000e+09 Capital Goods Auto Parts:O.E.M.
AAP 7.370000e+09 Consumer Services Other Specialty Stores
ASX 1.083000e+10 Technology Semiconductors
You should be able to change the type of the MarketCap_Num column by using:
df['MarketCap_Num'] = df.MarketCap.str[1:-1].astype(np.float64)
You can then check the data types by df.dtypes.
As for the filter, you can simple just say
df_filtered = df[df['Billion_Filter'] =="B"].copy()
since you will only have one letter in your Billion_Filter column.
Obhect datatype works as string. You should be able to use both str.contains and extract the number without having to convert the object type to string
df = df[df['MarketCap'].str.contains('B')].copy()
df['MarketCap'] = df['MarketCap'].str.extract('(\d+.?\d*)', expand = False)
MarketCap Sector industry
Symbol
DDD 1.12 Technology Computer Software: Prepackaged Software
MMM 141.99 Health Care Medical/Dental Instruments
WUBA 10.39 Technology Computer Software: Programming, Data Processing
EGHT 1.32 Public UtilitiesTelecommunications Equipment
AIR 1.48 Capital Goods Aerospace

Python: Average values in a CSV file based on value of another column

I am a noob and I have a large CSV file with data structured like this (with a lot more columns):
State daydiff
CT 5.5
CT 6.5
CT 6.25
NY 3.2
NY 3.225
PA 7.522
PA 4.25
I want to output a new CSV where the daydiff is averaged for each State like this:
State daydiff
CT 6.083
NY 3.2125
PA 5.886
I have tried numerous ways and the cleanest seemed to leverage pandas groupby but when i run the code below:
import pandas as pd
df = pd.read_csv('C:...input.csv')
df.groupby('State')['daydiff'].mean()
df.to_csv('C:...AverageOutput.csv')
I get a file that is identical to the original file but with a counter added in the first column with no header:
,State,daydiff
0,CT,5.5
1,CT,6.5
2,CT,6.25
3,NY,3.2
4,NY,3.225
5,PA,7.522
6,PA,4.25
I was also hoping to control the new average in datediff to a decimal going out only to the hundredths. Thanks
The "problem" with the counter is because the default behaviour for to_csvis to write the index. You should do df.to_csv('C:...AverageOutput.csv', index=False).
You can control the output format of daydiff by converting it to string. df.daydiff = df.daydiff.apply(lambda x: '{:.2f}'.format(x))
Your complete code should be:
df = pd.read_csv('C:...input.csv')
df2 = df.groupby('State')['daydiff'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv('C:...AverageOutput.csv')

Unable to merge two dataframes using pandas

Despite multiple attempts I am not succeeding in doing a simple merge operation on two dataframes. Below code returns
KeyError: 'CODE'
on the merge function.
Note 1: To make the post reproducible, StringIO is used here with only two lines within each CSV, but in real life I read from files with thousands of records.
Note 2: Notice the trailing ',' (separator) at the end of each line: my CSV files are badly formatted but this is how actual files are.
Note 3: I am using Python 2.7
from StringIO import StringIO
import pandas as pd
master = StringIO("""N-NUMBER,SERIAL NUMBER,MFR MDL CODE,ENG MFR MDL,YEAR MFR,TYPE REGISTRANT,NAME,STREET,STREET2,CITY,STATE,ZIP CODE,REGION,COUNTY,COUNTRY,LAST ACTION DATE,CERT ISSUE DATE,CERTIFICATION,TYPE AIRCRAFT,TYPE ENGINE,STATUS CODE,MODE S CODE,FRACT OWNER,AIR WORTH DATE,OTHER NAMES(1),OTHER NAMES(2),OTHER NAMES(3),OTHER NAMES(4),OTHER NAMES(5),EXPIRATION DATE,UNIQUE ID,KIT MFR, KIT MODEL,MODE S CODE HEX,
1 ,1071 ,3980115,54556,1988,5,FEDERAL AVIATION ADMINISTRATION ,WASHINGTON REAGAN NATIONAL ARPT ,3201 THOMAS AVE HANGAR 6 ,WASHINGTON ,DC,20001 ,1,001,US,20160614,19900214,1T ,5,5 ,V ,50000001, ,19880909, , , , , ,20191130,00524101, , ,A00001 ,""")
mfr = StringIO("""CODE,MFR,MODEL,TYPE-ACFT,TYPE-ENG,AC-CAT,BUILD-CERT-IND,NO-ENG,NO-SEATS,AC-WEIGHT,SPEED,
3980115,EXLINE ACE-C ,ACE-C ,4,1 ,1,1,01,001,CLASS 1,0082,""")
masterdf = pd.read_csv(master,sep=",",index_col=False)
mfrdf = pd.read_csv(mfr,sep=",",index_col=False)
masterdf.merge(mfrdataframe,left_on='MFR MDL CODE',right_on='CODE', how='inner')
I think that the problem is the name of the dataframe you're passing to merge: mfrdataframe should instead be mfrdf.

Categories

Resources