pandas.DataFrame: aggregate rows based on regex - python

What I want to do
I have a trouble to clean my data because some values were not input correctly.
import pandas as pd
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Current output!!!!
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#102: WWW 4 8
#101: foo foo 5 10
## DO SOMETHING!!!!
print(df)
## Expected output!!!!
# column1 column2
#100: Test 2 4
#101: FOO 8 16
#102: WWW 4 8
My DataFrame.index consists of "ID" + "Name". However, names are not correct, so one ID may show up in more than one row.
Two requests
Sum up rows with the same ID.
Choose one name for the result. (For example, I can use either "Test" or "test" for ID=100.)
What I tried
I tried to use groupby function, but it doesn't seem to have regex compatibility.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
df2 = df.groupby(level=0).sum()
print(df2)
## Output
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#101: foo foo 5 10
#102: WWW 4 8
Environment
Python 3.10.5
Pandas 1.4.3

Your expected output for Test does not reflect that you are trying to do a summation, but from what I can gather this is what you want. groupby can take a function or a mapping or even a series as the by argument. Here, you just want the lowercase version of the index:
df.groupby(df.index.str.lower()).sum()
which gives
column1 column2
100: test 3 6
101: foo 8 16
102: www 4 8
Here, what I've done is passed it the lowercase index, and it simply groups the rows based on matching elements in the series.
Edit
Based on the updated question, to match the numbers, you can use regular expressions:
df.groupby(df.index.str.extract(r"(\d+):", expand=False)).sum()
which gives
column1 column2
100 3 6
101 8 16
102 4 8
It isn't clear what would take precedence 101: foo foo or 101: FOO, it seems like the numbers here are the important part regardless.

import numpy as np
import pandas as pd
# Data Import
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
# Data Pre-process
df.reset_index(inplace=True)
df.rename(columns={'index':'ID_Name'},inplace=True)
df['ID'] = df['ID_Name'].str.split(':').str[0]
df.sort_values(['ID','ID_Name'],inplace=True)
df_group = df.groupby(['ID'])[['column1','column2']].sum().reset_index()
df_group
df = pd.merge(df,df_group,how='left',left_on='ID',right_on='ID')
df_final = df.groupby(['ID']).first()
# Data Clean Process
df_final.rename(columns={'column1_y':'column1','column2_y':'column2'},inplace= True)
df_final.drop(['column1_x','column2_x'],axis = 1 , inplace=True)
# Output Display
df_final
Hi Dmjy,
I have attached the code for you, please try from your side,
and if you still have any question please let me know
Thanks
Leon

Related

pandas combine nested dataframes into one single dataframe

I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?
You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>
This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)
Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18

excluding specific rows in pandas dataframe

I am trying to create a new dataframe selecting only those rows which a specific column value does not start with a capital S.
I have tried the following options:
New_dataframe = dataframe.loc[~dataframe.column.str.startswith(('S'))]
filter = dataframe['column'].astype(str).str.contains(r'^\S')
New_dataframe = dataframe[~filter]
However both options return an empty dataframe. Does anybody have a better solution?
Your code works well:
dataframe = pd.DataFrame({'ColA': ['Start', 'Hello', 'World', 'Stop'],
'ColB': [3, 4, 5, 6]})
New_dataframe = dataframe.loc[~df['ColA'].str.startswith('S')]
print(New_dataframe)
Output:
>>> New_dataframe
ColA ColB
1 Hello 4
2 World 5
>>> dataframe
ColA ColB
0 Start 3
1 Hello 4
2 World 5
3 Stop 6

How to get rows from only some columns based on columns value in Pandas?

I use Pandas to get datas from Excel. From those tables, I often need to find one or some values in only one row, based on value in a column.
I've read a lot about Pandas (doc and SO), and almost everytime, the question is like « how to SELECT * FROM df WHERE value = smthing ».
But what I'd like to do is more like :
SELECT Col1, Col2
FROM df
WHERE Col3.value = smthing
And I can't find any answer.
For example :
>>> dataFrame
foo bar sm_else
0 0 3 6
1 1 4 7
2 2 5 8
I want to get foo value and sm_else value when bar == 4.
So :
foo sm_else
1 7
Result can be DataFrame or can be list or dict, I don't really care.
Thanks !
How can I achieve this ?
df.loc can help you out
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
print(df.loc[df['col2'] == 4][['col1', 'col2']])
df.loc[df.bar == 4, ['foo', 'sm_else']]

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

How to update values in a specific row in a Python Pandas DataFrame?

With the nice indexing methods in Pandas I have no problems extracting data in various ways. On the other hand I am still confused about how to change data in an existing DataFrame.
In the following code I have two DataFrames and my goal is to update values in a specific row in the first df from values of the second df. How can I achieve this?
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
df2 = pd.DataFrame({'filename' : 'test2.dat', 'n':16}, index=[0])
# this overwrites the first row but we want to update the second
# df.update(df2)
# this does not update anything
df.loc[df.filename == 'test2.dat'].update(df2)
print(df)
gives
filename m n
0 test0.dat 12 None
1 test2.dat 13 None
[2 rows x 3 columns]
but how can I achieve this:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
[2 rows x 3 columns]
So first of all, pandas updates using the index. When an update command does not update anything, check both left-hand side and right-hand side. If you don't update the indices to follow your identification logic, you can do something along the lines of
>>> df.loc[df.filename == 'test2.dat', 'n'] = df2[df2.filename == 'test2.dat'].loc[0]['n']
>>> df
Out[331]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to do this for the whole table, I suggest a method I believe is superior to the previously mentioned ones: since your identifier is filename, set filename as your index, and then use update() as you wanted to. Both merge and the apply() approach contain unnecessary overhead:
>>> df.set_index('filename', inplace=True)
>>> df2.set_index('filename', inplace=True)
>>> df.update(df2)
>>> df
Out[292]:
m n
filename
test0.dat 12 None
test2.dat 13 16
In SQL, I would have do it in one shot as
update table1 set col1 = new_value where col1 = old_value
but in Python Pandas, we could just do this:
data = [['ram', 10], ['sam', 15], ['tam', 15]]
kids = pd.DataFrame(data, columns = ['Name', 'Age'])
kids
which will generate the following output :
Name Age
0 ram 10
1 sam 15
2 tam 15
now we can run:
kids.loc[kids.Age == 15,'Age'] = 17
kids
which will show the following output
Name Age
0 ram 10
1 sam 17
2 tam 17
which should be equivalent to the following SQL
update kids set age = 17 where age = 15
If you have one large dataframe and only a few update values I would use apply like this:
import pandas as pd
df = pd.DataFrame({'filename' : ['test0.dat', 'test2.dat'],
'm': [12, 13], 'n' : [None, None]})
data = {'filename' : 'test2.dat', 'n':16}
def update_vals(row, data=data):
if row.filename == data['filename']:
row.n = data['n']
return row
df.apply(update_vals, axis=1)
Update null elements with value in the same location in other.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0
more information in this link
There are probably a few ways to do this, but one approach would be to merge the two dataframes together on the filename/m column, then populate the column 'n' from the right dataframe if a match was found. The n_x, n_y in the code refer to the left/right dataframes in the merge.
In[100] : df = pd.merge(df1, df2, how='left', on=['filename','m'])
In[101] : df
Out[101]:
filename m n_x n_y
0 test0.dat 12 None NaN
1 test2.dat 13 None 16
In[102] : df['n'] = df['n_y'].fillna(df['n_x'])
In[103] : df = df.drop(['n_x','n_y'], axis=1)
In[104] : df
Out[104]:
filename m n
0 test0.dat 12 None
1 test2.dat 13 16
If you want to put anything in the iith row, add square brackets:
df.loc[df.iloc[ii].name, 'filename'] = [{'anything': 0}]
I needed to update and add suffix to few rows of the dataframe on conditional basis based on the another column's value of the same dataframe -
df with column Feature and Entity and need to update Entity based on specific feature type
df.loc[df.Feature == 'dnb', 'Entity'] = 'duns_' + df.loc[df.Feature == 'dnb','Entity']

Categories

Resources