Finding conditioned consecutive values in a pandas DataFrame - python

I have a pandas dataframe with multiple rows and columns filled with types and values. All are strings. I want to write a function that conditions:
1) which type I search (column 1)
2) a first value (column 2)
3) a second, consecutive value (in the next row of column 2)
I manage to write a function that searches one value of one type as below, but how do I add the second type? I think it might be with help of df.shift(axis=0), but I do not know how to combine that command with a conditional search.
import pandas as pd
d = {'type': ['wordclass', 'wordclass', 'wordclass', 'wordclass', 'wordclass', 'wordclass',
'english', 'english', 'english', 'english', 'english', 'english'],
'values': ['dem', 'noun', 'cop', 'det', 'dem', 'noun', 'this', 'tree', 'is', 'a', 'good', 'tree']}
df = pd.DataFrame(data=d)
print(df)
tiername = 'wordclass'
v1 = 'dem'
v2 = 'noun'
def search_single_tier(tiername, v1):
searchoutput = df[df['type'].str.contains(tiername) & df['values'].str.match(v1)]
return searchoutput
x = search_single_tier(tiername, v1)
print(x)```

You don't need to create a function for doing this. Instead, try this:
In [422]: tiername = 'wordclass'
## This equates `type` columns to `tiername`.
## `.iloc[0:2]` gets the first 2 rows for the matched condition
In [423]: df[df.type.eq(tiername)].iloc[0:2]
Out[423]:
type values
0 wordclass dem
1 wordclass noun
After Op's comment:
Find all consecutive rows like this:
tiername = 'wordclass'
v1 = 'dem'
In [455]: ix_list = df[df.type.eq(tiername) & df['values'].eq(v1)].index.tolist()
In [464]: pd.concat([df.iloc[ix_list[0]: ix_list[0]+2], df.iloc[ix_list[1]: ix_list[1]+2]])
Out[464]:
type values
0 wordclass dem
1 wordclass noun
4 wordclass dem
5 wordclass noun

Related

Python: how to identify common elements in lists from two dataframes' series

Using Pandas, I have two data sets stored in two separate dataframes. Each dataframe is composed of two series.
The first dataframe has a series called 'name', the second series is a list of strings. It looks something like this:
name attributes
0 John [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1 Mike [EUD, DBS, QMD, ABC, GHI]
2 Jane [JKL, EJD, MDE, MNO, DEF, ABC]
3 Kevin [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4 Stefanie [STU, EJD, DUE]
The second dataframe is similar with the first series being
username attr
0 username_1 [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1 username_2 [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2 username_3 [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3 username_4 [DJH, EUD, GHI, MNO, ABC, FHE]
4 username_5 [CHQ, ELT, ABC, DEF, GHI]
What I'm trying to achieve is to compare the attributes (second series) of each dataframe to see which names and usernames share the most attributes.
For example, username_4 has 5 out of 6 attributes matching those of Kevin's.
I thought of looping one of the attributes series and see if there's a match in each row of the other series but couldn't loop effectively (maybe because my lists don't have quotation marks around the strings?).
I don't really know what possibilities exist to compare those two series and end up with a result as mentioned above (username_4 has 5 out of 6 attributes matching those of Kevin's).
What would be the possible approach(es) here?
You could try a method like below:
# Import pandas library
import pandas as pd
# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']],
['Stefanie', ['STU', 'EJD', 'DUE']]]
data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']],
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])
# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
outputDictionary = {} # Set a dictionary for our output
for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
dictBuilder = {}
for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
name = row['name']
dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
print(outputDictionary) # Debug print statement
return outputDictionary # Return our output dictionary here for further processing
a = func(df2, df1)
That should yield an output like below:
{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}
Where each item in the dictionary returned from outputDictionary will have:
Dictionary key value equal to the username from the second data frame
Dictionary value equal to a list, containing the name and count with the most matches as compared to our first data frame
Note that this method could be optimized in how it loops over each row in the two data frames - The thread below describes a few different ways to process rows in data frames:
How to iterate over rows in a DataFrame in Pandas

Pandas: Create a new column with coulmn name and cell of matching string

I am searching through a large spreadsheet with 300 columns and over 200k rows. I would like to create a column that has the the column header and matching cell value. Some thing that looks like "Column||Value." I have the search term and the join aggregator. I can get the row index name but I'm struggling getting the matching column and specific cell. Here's me code so far
df = pd.read_excel (r"Test_file")
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df['extract'] = df.loc[mask] #This only give me the index name. I would like the actual matched cell contents.
df['extract2'] = Column name
df['Match'] = df[['extract', 'extract2']].agg('||'.join.axis=1)
df.drop(['extract', 'extract2'], axis=1)
Final output should look something like
Output
You can create a mask for a specific column first (I edited your 2nd line a bit), then create a new 'Match' column with all values initialized to 'No Match', and finally, change the values to your desired format ("Column||Value") for rows that are returned after applying the mask. I implemented this in the following sample code:
def match_column(df, column_name):
column_mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm']))[column_name]
df['Match'] = 'No Match'
df.loc[column_mask, 'Match'] = column_name + ' || ' + df[column_name]
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
df = match_column(df, 'Segment')
display(df)
Output:
However, this only works for a single column. I don't know what output you want for cases when there are matches in multiple columns (if you can, please specify).
UPDATE:
If you want to use a list of columns as input and match with the first instance, you can use this instead:
def match_first_column(df, column_list):
df['Match'] = 'No Match'
# iterate over rows
for index, row in df.iterrows():
# iterate over column names
for column_name in column_list:
column_value = row[column_name]
substrings = ['Chann', 'Midm', 'Fran']
# if a match is found
if any(x in column_value for x in substrings):
# add match string
df.loc[index, 'Match'] = column_name + ' || ' + column_value
# stop iterating and move to next row
break
return df
df = {
'Segment': ['Government', 'Government', 'Midmarket', 'Midmarket', 'Government', 'Channel Partners'],
'Country': ['Canada', 'Germany', 'France', 'Canada', 'France', 'France']
}
df = pd.DataFrame(df)
display(df)
column_list= df.columns.tolist()
match_first_column(df, column_list)
Output:
You can try:
mask = df.astype(str).applymap(lambda x: any(y in x for y in ['Chann','Midm'])).any(1)
df.loc[mask, 'Match'] = '||'.join(df[['extract', 'extract2']])
df['Match'].fillna('No Match', inplace=True)

Add a new column containing the difference between EACH TWO ROWS of another column of a data frame

I would like to get the difference between each 2 rows of the column duration and then fill the values in a new column differenceor print it.
So basically I want: row(1)-row(2)=difference1, row(3)-row(4)=difference2, row(5)-row(6)=difference3 ....
Example of a code:
data = {'Profession':['Teacher', 'Banker', 'Teacher', 'Judge','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Male','Male','Female'],'Size':['M','M','L','S','S','M'],'Duration':['5','6','2','3','4','7']}
data2={'Profession':['Doctor', 'Scientist', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Male','Male', 'Female','Female','Male','Male'],'Size':['L','M','L','M','L','L'],'Duration':['1','2','9','10','1','17']}
data3 = {'Profession':['Banker', 'Banker', 'Doctor', 'Doctor','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Female','Female','Male'],'Size':['S','M','S','M','L','S'],'Duration':['15','8','5','2','11','10']}
data4={'Profession':['Judge', 'Judge', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Female','Female', 'Female','Female','Female','Female'],'Size':['M','S','L','S','M','S'],'Duration':['1','2','9','10','1','17']}
df= pd.DataFrame(data)
df2=pd.DataFrame(data2)
df3=pd.DataFrame(data3)
df4=pd.DataFrame(data4)
DATA=pd.concat([df,df2,df3,df4])
DATA.groupby(['Profession','Size','Gender']).agg('sum')
D=DATA.reset_index()
D['difference']=D['Duration'].diff(-1)
I tried using diff(-1) but it's not exactly what I'm looking for. any ideas?
Is that what you wanted?
D["Neighbour"]=D["Duration"].shift(-1)
# fill empty lines with 0
D["Neighbour"] = D["Neighbour"].fillna(0)
# convert columns "Neighbour" and "Duration" to numeric
D["Neighbour"] = pd.to_numeric(D["Neighbour"])
D["Duration"] = pd.to_numeric(D["Duration"])
# get difference
D["difference"]=D["Duration"] - D["Neighbour"]
# remove "Neighbour" column
D = D.drop(columns=["Neighbour"], axis=1)
# remove odd lines
D.loc[1::2,"difference"] = None
# print D
D

how to create new column on the basis of word matched [duplicate]

This question already has answers here:
How to map key to multiple values to dataframe column?
(2 answers)
Closed 3 years ago.
how to add a new column on the basis searched item like if dataframe column contain BX-- then in new column it should replace it with BOX as there are more than 30 short form
i think dictionary would be best option for replacement
mapping= {
'BX': 'BOX',
'CS': 'CASE',
'EA': 'EACH',
'PK': 'PACK',
'None': None
}
import pandas as pd
lst = ['BX', 'EA', 'EA', 'PK', 'BG','CS']
df = pd.DataFrame(lst)
df.map(mapping)
somehow i am not able to do it
You can do this as follows.
# first define a mapping
mapping= {
'BX': 'BOX',
'CS': 'CASE',
'EA': 'EACH',
'PK': 'PACK',
'None': None
}
# then apply it with map (assuming your abbreviations are
# stored in column short and the result should be stroed
# in long)
df['long']=df['short'].map(mapping)
With the following test dataframe
lst = ['BX', 'EA', 'EA', 'PK', 'BG','CS']
df = pd.DataFrame(dict(short=lst))
df['short'].map(mapping)
It outputs:
Out[447]:
short long
0 BX BOX
1 EA EACH
2 EA EACH
3 PK PACK
4 BG NaN
5 CS CASE

Compare two dataframe columns for matching percentage

I want to compare a data frame of one column with another data frame of multiple columns and return the header of the column having maximum match percentage.
I am not able to find any match functions in pandas. First data frame first column :
cars
----
swift
maruti
wagonor
hyundai
jeep
First data frame second column :
bikes
-----
RE
Ninja
Bajaj
pulsar
one column data frame :
words
---------
swift
RE
maruti
waganor
hyundai
jeep
bajaj
Desired output :
100% match header - cars
Try to use isin function of pandas DataFrame. Assuming df is your first dataframe and words is a list :
In[1]: (df.isin(words).sum()/df.shape[0])*100
Out[1]:
cars 100.0
bikes 20.0
dtype: float64
You may need to lowercase strings in your df and in the words list to avoid any casing issue.
You can first get the columns into lists:
dfCarsList = df['cars'].tolist()
dfWordsList = df['words'].tolist()
dfBikesList = df['Bikes'].tolist()
And then iterate of the list for comparision:
numberCars = sum(any(m in L for m in dfCarsList) for L in dfWordsList)
numberBikes = sum(any(m in L for m in dfBikesList) for L in dfWordsList)
The higher number you can use than for your output.
Construct a Series using numpy.in1d and ndarray.mean then call the Series.idxmax and max methods:
# Setup
df1 = pd.DataFrame({'cars': {0: 'swift', 1: 'maruti', 2: 'waganor', 3: 'hyundai', 4: 'jeep'}, 'bikes': {0: 'RE', 1: 'Ninja', 2: 'Bajaj', 3: 'pulsar', 4: np.nan}})
df2 = pd.DataFrame({'words': {0: 'swift', 1: 'RE', 2: 'maruti', 3: 'waganor', 4: 'hyundai', 5: 'jeep', 6: 'bajaj'}})
match_rates = pd.Series({col: np.in1d(df1[col], df2['words']).mean() for col in df1})
print('{:.0%} match header - {}'.format(match_rates.max(), match_rates.idxmax()))
[out]
100% match header - cars
Here is a solution with a function that returns a tuple (column_name, match_percentage) for the column with the maximum match percentage. It accepts a pandas dataframe (bikes and cars in your example) and a series (words) as arguments.
def match(df, se):
max_matches = 0
max_col = None
for col in df.columns:
# Get the number of matches in a column
n_matches = sum([1 for row in df[col] if row in se.unique()])
if n_matches > max_matches:
max_col = col
max_matches = n_matches
return max_col, max_matches/df.shape[0]
With your example, you should get the following output.
df = pd.DataFrame()
df['Cars'] = ['swift', 'maruti', 'wagonor', 'hyundai', 'jeep']
df['Bikes'] = ['RE', 'Ninja', 'Bajaj', 'pulsar', '']
se = pd.Series(['swift', 'RE', 'maruti', 'wagonor', 'hyundai', 'jeep', 'bajaj'])
In [1]: match(df, se)
Out[1]: ('Cars', 1.0)

Categories

Resources