Python: unable to compare strings in dataframes

Python: unable to compare strings in dataframes - python

I am trying to lookup string values in two dataframes and I am using Pandas library.
The first dataframe - df_transactions has a list of error codes in the column 'ErrList'
The second dataframe - df_action has a list of errors in one column 'CODE' and the corresponding error in the column 'ACTION'.
I am trying to compare the two strings from these dataframes as below:
ActionLookup_COL = []
ActionLookup = []
for index, transactions in df_transactions.iterrows():
errorList = transactions['ErrList']
for index, errorCode in df_action.iterrows():
eCode = errorCode['Code']
eAction = errorCode['Action']
if eCode ==errorList:
ActionLookup.append(eAction)
ActionLookup_COL.append(ActionLookup)
df_results['ActionLookup'] = pd.Series(shipmentActionLookup_COL, index=df_results.index)
When I print the dataframe df_results['ActionLookup'], I do not get the action code corresponding to the error code. Please let me know how can I compare the strings in these dataframes
Thanks for your time!

IIUC you need merge:
pd.merge(df_transactions, df_action, left_on='ErrList', right_on='Code')
Sample:
df_transactions = pd.DataFrame({'ErrList':['a','af','e','d'],
'col':[4,5,6,8]})
print (df_transactions)
ErrList col
0 a 4
1 af 5
2 e 6
3 d 8
df_action = pd.DataFrame({'Code':['a','af','u','m'],
'Action':[1,2,3,4]})
print (df_action)
Action Code
0 1 a
1 2 af
2 3 u
3 4 m
df_results = pd.merge(df_transactions, df_action, left_on='ErrList', right_on='Code')
print (df_results)
ErrList col Action Code
0 a 4 1 a
1 af 5 2 af
print (df_results['Action'])
ErrList col Action Code
0 a 4 1 a
1 af 5 2 af

Related

pandas combine multiple row into one, and update other columns [duplicate]

I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks

You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])

you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0

Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0

Iterating Conditions through Pandas .loc

I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)

Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0

Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]

Adding values in a cell in Pandas

I am trying to add values in cells of one column in Pandas Dataframe. The dataframe was created:
data = [['ID_123456', 'example=1(abc)'], ['ID_123457', 'example=1(def)'], ['ID_123458', 'example=1(try)'], ['ID_123459', 'example=1(try)'], ['ID_123460', 'example=1(try),2(test)'], ['ID_123461', 'example=1(try),2(test),9(yum)'], ['ID_123462', 'example=1(try)'], ['ID_123463', 'example=1(try),7(test)']]
df = pd.DataFrame(data, columns = ['ID', 'occ'])
display(df)
The table looks like this:
ID occ
ID_123456 example=1(abc)
ID_123457 example=1(def)
ID_123458 example=1(try)
ID_123459 example=1(test)
ID_123460 example=1(try),2(test)
ID_123461 example=1(try),2(test),9(yum)
ID_123462 example=1(test)
ID_123463 example=1(try),7(test)
The following link is related to it but I was unable to run the command on my dataframe.
Sum all integers in a PANDAS DataFrame "cell"
The command gives an error of "string index out of range".
The output should look like this:
ID occ count
ID_123456 example=1(abc) 1
ID_123457 example=1(def) 1
ID_123458 example=1(try) 1
ID_123459 example=1(test) 1
ID_123460 example=1(try),2(test) 3
ID_123461 example=1(try),2(test),9(yum) 12
ID_123462 example=1(test) 1
ID_123463 example=1(try),7(test) 8

If want sum all numbers on column occ use Series.str.extractall, convert to integers with sum:
df['count'] = df['occ'].str.extractall('(\d+)')[0].astype(int).sum(level=0)
print (df)
ID occ count
0 ID_123456 example=1(abc) 1
1 ID_123457 example=1(def) 1
2 ID_123458 example=1(try) 1
3 ID_123459 example=1(try) 1
4 ID_123460 example=1(try),2(test) 3
5 ID_123461 example=1(try),2(test),9(yum) 12
6 ID_123462 example=1(try) 1
7 ID_123463 example=1(try),7(test) 8

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?

Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Python: Efficently extract a single value for every group

I need to add a description column to a dataframe that is built by grouping items from another dataframe.
grouped= df1.groupby('item')
list= grouped['total'].agg(np.sum)
list= list.reset_index()
to assign a description label to every item I've come up with this solution:
def des(item):
return df1['description'].loc[df1['item']== item].iloc[0]
list['description'] = list['item'].apply(des)
it works but it takes an enourmous amount of time to execute.
I'd like to do something like that
list=list.assign(description= df1['description'].loc[df1['item']==list['item']]
or
list=list.assign(description= df1['description'].loc[df1['item'].isin(list['item'])]
Theese are very wrong but hope you get the idea, hoping there is some pandas stuff that do the trick more efficently but can't find it
Any ideas?

I think you need DataFrameGroupBy.agg by dict of functions - for column total sum and for description first:
df = df1.groupby('item', as_index=False).agg({'total':'sum', 'description':'first'})
Also dont use variable name list, because list is python code reserved word.
Sample:
df1 = pd.DataFrame({'description':list('abcdef'),
'B':[4,5,4,5,5,4],
'total':[5,3,6,9,2,4],
'item':list('aaabbb')})
print (df1)
B description item total
0 4 a a 5
1 5 b a 3
2 4 c a 6
3 5 d b 9
4 5 e b 2
5 4 f b 4
df = df1.groupby('item', as_index=False).agg({'total':'sum', 'description':'first'})
print (df)
item total description
0 a 14 a
1 b 15 d

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: unable to compare strings in dataframes - python

Related

pandas combine multiple row into one, and update other columns [duplicate]

Iterating Conditions through Pandas .loc

Adding values in a cell in Pandas

Pandas DataFrames: Extract Information and Collapse Columns

Python: Efficently extract a single value for every group

Categories

Resources