Automatically classify elements in an array - python

I am building an itinerary of vegetation types for a given location. Data is passed to me as a CSV and I want a way in which I can automatically re-classify items in one column, into broader classes that I provide. I can already read the data in with pandas, do a bit of housekeeping and then write out the data frame to a new file.
However, given I am provided with a column such that:
species = ['maple', 'oak', holly, 'sawgrass', 'cat tails'...... 'birch']
I would like to be able to automatically, reclassify these into broad categories using another list like:
VegClass = ['Tree', 'Bush', 'Grass']
The only way I know to do this would be to iterate through the species list, in a manner similar to:
out = []
for i in species:
if species[i]=='Oak':
out.append('Tree')
but this would require that I write a lot of code if the species list becomes very large and I don't imagine it would be very efficient with large datasets.
Is there a more direct way of doing this? I understand that I would need to list all the species manually (in separate classes) e.g.:
TreeSpecies = ['oak'....'birch']
GrassSpecies = ['Sawgrass....']
but I would only have to do this once to build a dictionary of species names. Im expecting more data so may have to add an additional species name or two in future, but this would not be considered too time intensive if I could process a lot of the data quickly.

You need to create a dict of classifier mappings for your different items, for instance,
classifier = {'oak': 'Tree',
'maple': 'Tree',
'holly': 'Tree',
'sawgrass': 'Grass',
'cat tails': 'Bush',
'birch': 'Tree'}
Then getting a column of groups is as simple as calling map on your column.
>>> df.species.map(classifier)
0 Tree
1 Tree
2 Tree
3 Grass
4 Bush
5 Tree
Name: species, dtype: object
so you can set a new column with
df['classification'] = df.species.map(classifier)

You need a dictionary like
VegClass = {'oak': 'Tree', 'seagrass': 'Grass'}
df['class'] = df['species'].map(VegClass)

I don't know if I follow you, but since you will have to create some sort of associative list, in the form
plant | type
oak | tree
sawgrass | grass
kkk | bush
...
Just create a hash table and get the type from the hash table.
You may read the table from an external file so it is not hardcoded in your program.

Related

Using dictionary and dataframe to create new arrays with variable names with loop

I am currently working to process some data that are imported to Python as a dataframe that has 10000 rows and 20 columns. The columns store sample names and chemical element. The daaaframe is currently indexed by both sample name and time, appearing as so:
[1]: https://i.stack.imgur.com/7knqD.png .
From this dataframe, I want to create individual arrays for each individual sample, of which there are around 25, with a loop. I have generated an index and array of the sample names, which yields an array that appears as so
samplename = fuegodataframe.index.levels[0]
samplearray = samplename.to_numpy()
array(['AC4-EUH41', 'AC4-EUH79N', 'AC4-EUH79S', 'AC4-EUH80', 'AC4-EUH81',
'AC4-EUH81b', 'AC4-EUH82N', 'AC4-EUH82W', 'AC4-EUH84',
'AC4-EUH85N', 'AC4_EUH48', 'AC4_EUH48b', 'AC4_EUH54N',
'AC4_EUH54S', 'AC4_EUH60', 'AC4_EUH72', 'AC4_EUH73', 'AC4_EUH73W',
'AC4_EUH78', 'AC4_EUH79E', 'AC4_EUH79W', 'AC4_EUH88', 'AC4_EUH89',
'bhvo-1', 'bhvo-2', 'bir-1', 'bir-2', 'gor132-1', 'gor132-2',
'gor132-3', 'sc ol-1', 'sc ol-2'], dtype=object)
I have also created a dictionary with keys of each of these variable names. I am now wondering how I would use this dictionary to generate individual variables for each of these samples that capture all the rows in which a sample is found.
I have tried something along these lines:
for ii in sampledictionary.keys():
if ii == sampledictionary[ii]:
sampledictionary[ii] = fuegodataframe.loc[sampledictionary[ii]]
but this fails. How would I actually go about doing something like this? Is this possible?
I think you're asking how to generate variables dynamically rather than assign your output to a key in your dictionary.
In Python there is a globals function globals() that will output all the variable names defined in the document.
You can assign new variables dynamically to this dictionary
globals()[f'variablename_{ii}'] = fuegodataframe.loc[sampledictionary[ii]]
etc.
if ii was 0 then variablename_0 would be available with the assigned value.
In general this is not considered good practice but it is required sometimes.

Creating a series of variables from CSVs in Python?

I am trying to create a series of dictionaries from CSVs that I want to import but I am not sure the best way to do it.
I used RatingFactors = os.listdir(RatingDirectory) and
CSVLocations = []
for factor in RatingFactors:
CSVLocations.append(RatingDirectory + factor)
to create a list of CSVs, these CSVs contain what is essentially a dictionary of FactorName | Factor Value, then 1 | 5, 2 | 3.5.
I want to create a dictionary for each CSV, ideally named based on the CSVs name. However I understand that when looping across variables it is considered bad to try and name my variables inside the loop.
I tried creating a generator function using df_from_each_file = (pd.read_csv(CSVs) for CSVs in CSVLocations)
and if I print the generator using for y in df_from_each_file:
print(y) it gives me each of the dataframes but I don't know how to separate them out?
What is the Pythonic way to do this?
How the CSVs look post import
0 0 1.1
1 1 0.9
2 2 0.9
3 3 0.9
etc
Edit:
Attempt to rephrase my question.
I have a series of CSVs which look like they are formatted like dictionaries, they have two columns and they represent how one factor relates to another. I would like to make a dictionary for each CSV, named like the CSV so that I can interact with them from Python.
Edit 2:
I believe this question is different than the one referenced as that is creating a single dataframe which contains all of the dictionaries, I want all of the dictionaries to be separate rather than in a single unit. I tried using their answer before asking this and I could not separate them out.
I think need dict comprehension with basename for keys:
import glob, os
files = glob.glob('files/*.csv')
sers={os.path.basename(f).split('.')[0]:pd.read_csv(f,index_col=[0]).squeeze() for f in files}
If want one big Series:
d = pd.concat(sers, ignore_index=False)

Python Data Analysis from SQL Query

I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]

How to merge two datasets by specific column in pandas

I'm playing around with the Kaggle dataset "European Soccer Database" and want to combine it with another FIFA18-dataset.
My problem is the name-column in these two datasets are using different format.
For example: "lionel messi" in one dataset and in the other it is "L. Messi"
I would to convert the "L. Messi" to the lowercase version "lionel messi" for all rows in dataset.
What would be the most intelligent way to go about this?
One simple way is to convert the names in both dataframes into a common format so they can be matched.* Let's assume that in df1 names are in the L. Messi format and in df2 names are in the lionel messi format. What would a common format look like? You have several choices, but one option would be all lowercase, with just the first initial followed by a period: l. messi.
df1 = pd.DataFrame({'names': ['L. Messi'], 'x': [1]})
df2 = pd.DataFrame({'names': ['lionel messi'], 'y': [2]})
df1.names = df1.names.str.lower()
df2.names = df2.names.apply(lambda n: n[0] + '.' + n[n.find(' '):])
df = df1.merge(df2, left_on='names', right_on='names')
*Note: This approach is totally dependent on the names being "matchable" in this way. There are plenty of cases that could cause this simple approach to fail. If a team has two members, Abby Wambach and Aaron Wambach, they'll both look like a. wambach. If one dataframe tries to differentiate them by using other initials in their name, like m.a. wambach and a.k. wambach, the naive matching will fail. How you handle this depends on the size your data - maybe you can try match most players this way, and see who gets dropped, and write custom code for them.

Categories from a List in a DataFrame

There is a process in Pandas DataFrame that I am trying to do for my capstone project on the Yelp Dataset Challenge. I have found a way to do it using loops, but given the large dataset I am working with, it takes a long time. (I tried running it 24 hours, and it still was not complete.) 
Is there a more efficient way to do this in Pandas without looping? 
Note: business.categories (business is a DataFrame) provides a list of categories a business is in stored as a string (e.g. "[restaurant, entertainment, bar, nightlife]"). It is written in the format of a list bust saved as a string. 
# Creates a new DataFrame with businesses as rows and columns as categories tags with 0 or 1 depending on whether the business is that category
categories_list = []
# Makes empty values an string of empty lists. This prevents Null errors later in the code.
business.categories = business.categories.fillna('[]')
# Creates all categories as a single list. Goes through each business's list of categories and adds any unique values to the master list, categories_list
for x in range(len(business)):
# business.categories is storing each value as a list (even though it's formatted just like a string), so this converts it to a List
categories = eval(str(business.categories[x]))
# Looks at each categories, adding it to categories_list if it's not already there
for category in categories:
if category not in categories_list:
categories_list.append(category)
# Makes the list of categories (and business_id) the colums of the new DataFrame
categories_df = pd.DataFrame(columns = ['business_id'] + categories_list, index = business.index)
# Loops through determining whether or not each business has each category, storing this as a 1 or 0 for that category type respectivity.
for x in range(len(business)):
for y in range(len(categories_list)):
cat = categories_list[y]
if cat in eval(business.categories[x]):
categories_df[cat][x] = 1
else:
categories_df[cat][x] = 0
# Imports the original business_id's into the new DataFrame. This allows me to cross-reference this DataFrame with my other datasets for analysis
categories_df.business_id = business.business_id
categories_df
Given that the data is stored as list-like strings, I don't think you can avoid looping over the data frame (either explicitly or implicitly, using str methods) in Python speeds (this seems like an unfortunate way of storing the data. can itbe avoided upstream?). However, I have some ideas for improving the approach. Since you know the resulting index ahead of time, you could immediately start building the DataFrame without knowing all the categories in advance, something like
categories_df = pd.DataFrame(index=business.index)
for ix, categories in business.categories.items():
for cat in eval(categories):
categories_df.loc[ix, cat] = 1
# if cat is not already in the columns this will add it in, with null values in the other rows
categories_df.fillna(0, inplace=True)
If you know some or all of the categories in advance then adding them as columns initially before the loop should help as well.
Also, you could try doing categories[1:-1].split(', ') instead of eval(categories). A quick test tells me it should be around 15 times faster.
To ensure the same result, you should do
for ix, categories in business.categories.items():
for cat in categories[1:-1].split(','):
categories_df.loc[ix, cat.strip()] = 1
to be on the safe side, as you won't know how much white space there might be around the commas. Avoiding much of the nested looping and in statements should speed your programme up considerably.
Not exactly sure what you ultimately want to do is... But
Consider the dataframe business
business = pd.DataFrame(dict(
categories=['[cat, dog]', '[bird, cat]', '[dog, bird]']
))
You can convert these strings to lists with
business.categories.str.strip('[]').str.split(', ')
Or even pd.get_dummies
business.categories.str.strip('[]').str.get_dummies(', ')
bird cat dog
0 0 1 1
1 1 1 0
2 1 0 1

Categories

Resources