Python function that evaluates each row by a list - python

I am using Python to clean address data and standardize abbreviations, etc. so that it can be compared against other address data. I finally have 2 dataframes in Pandas. I would like to compare each row in the first df, named df, against a list created from another list of addresses in a df of similar structure, second_df. If the address from df is on the list, then I would like to create a column to note this, maybe a boolean, but best case the string 'found'. I have used isin and it did not work.
For example, suppose my data looks like the sample data below. I would like to compare each row in df['concat'] to the entire list list to see if the address in df['concat'] column appears in the second_df list.
read = pd.read_excel('fullfilepath.xlsx')
second_df = pd.read_excel('anotherfilepath.xlsx')
df = read[['column1','column2', 'concat']]
list = second_df.concat.tolist()

EDIT based on tdy comment as my original answer didn't have the value for False option in where statement.
Try sth like this:
df["isFound"] = np.where(df['concat'].isin(second_df["concat"]), "found", "notfound")
Should be exactly what you need

Related

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

Filtering Pandas Dataframe Based on List of Column Names

I have a pandas dataframe which has may be 1000 Columns. However I do not need so many columns> I need columns only if they match/starts/contains specific strings.
So lets say I have a dataframe columns like
df.columns =
HYTY, ABNH, CDKL, GHY#UIKI, BYUJI##hy BYUJI#tt BBNNII#5 FGATAY#J ....
I want to select columns whose name are only like HYTY, CDKL, BYUJI* & BBNNI*
So what I was trying to do is to create a list of regular expressions like:
import re
relst = ['HYTY', 'CDKL*', 'BYUJI*', 'BBNI*']
my_w_lst = [re.escape(s) for s in relst]
mask_pattrn = '|'.join(my_w_lst)
Then I create the logical vector to give me a list of TRUE/FALSE to say whether the string is present or not. However, not understanding how to get the dataframe of only those true selected columns from this.
Any help will be appreciated.
Using what you already have you can pass your mask to filter like:
df.filter(regex=mask_pattrn)
Use re.findall(). It will give you a list of columns to pass to df[mylist]
We can do startswith
relst = ['CDKL', 'BYUJI', 'BBNI']
subdf = df.loc[:,df.columns.str.startswith(tuple(relst))|df.columns.isin(['HYTY'])]

Removing rows from a dataframe, if the observation for a specific variable is numeric

data example I have a large data frame with over 20000 observations, I have a variable called “station” and I need to remove all rows that only have numbers as the s station name.
The only code that has worked so far is :
Df[‘station’][~df[‘station’].str.isnumeric()
However this only creates a data frame with one variable
You can use an extre column with .str.isnumeric() to be used later on as a filter:
df['filter'] = df['station'].str.isnumeric()
df_filtered = df[df['filter'] != False]#.drop(columns=['filter']
This should return all rows that are not only numbers for the column station. After that, you can remove the hash if you wish to drop the filter column to mantain you original structure.
You can do it like so,
df_filtered = df[df['station'].str.isnumeric()==False]
You wouldn't have to do set operations on your dataframe if you use this.
The internal statements are ultimately a Boolean logic filter that is being applied on the dataframe.

adding row from one dataframe to another

I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])

Looping through a list in Python and creating new objects based on items

Let's say I have a list of objects (in this instance, dataframes)
myList = [dataframe1, dataframe2, dataframe3 ...]
I want to loop over my list and create new objects based on the names of the list items. What I want is a pivoted version of each dataframe, called "dataframe[X]_pivot" where [X] is the identifier for that dataframe.
My pseudocode looks something like:
for d in myList:
d+'_pivot' = d.pivot_table(index='columnA', values=['columnB'], aggfunc=np.sum)
And my desired output looks like this:
myList = [dataframe1, dataframe2 ...]
dataframe1_pivoted # contains a pivoted version of dataframe1
dataframe2_pivoted # contains a pivoted version of dataframe2
dataframe3_pivoted # contains a pivoted version of dataframe3
Help would be much appreciated.
Thanks
John
You do not want to do that. Creating a variables dynamically is almost always a very bad idea. The correct thing to do would be to simply use an appropriate data structure to hold your data, e.g. either a list (as your elements are all just numbered, you can just as well access them via an index) or a dictionary (if you really really want to give a name to each individual thing):
pivoted_list = []
for df in mylist:
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
pivoted_list.append(pivoted_df)
#now access your results by index
do_something(pivoted_list[0])
do_something(pivoted_list[1])
The same thing can be expressed as a list comprehension. Assume pivot is a function that takes a dataframe and turns it into a pivoted frame, then this is equivalent to the loop above:
pivoted_list = [pivot(df) for df in mylist]
If you are certain that you want to have names for the elements, you can create a dictionary, by using enumerate like this:
pivoted_dict = {}
for index, df in enumerate(mylist):
pivoted_df = #whatever you need to to to turn a dataframe into a pivoted one
dfname = "dataframe{}_pivoted".format(index + 1)
pivoted_dict[dfname] = pivoted_df
#access results by name
do_something(pivoted_dict["dataframe1_pivoted"])
do_something(pivoted_dict["dataframe2_pivoted"])
The way to achieve that is:
globals()[d+'_pivot'] = d.pivot_table(...)
[edit] after looking at your edit, I see that you may want to do something like this:
for i, d in enumerate(myList):
globals()['dataframe%d_pivoted' % i] = d.pivot_table(...)
However, as others have suggested, it is unadvisable to do so if that is going to create lots of global variables.
There are better ways (read: data structures) to do so.

Categories

Resources