python pandas function df pointer doesn't change values - python

I'm trying to give a function a pointer to an existing df, and trying to copy values from one df to another. but after the function is finished, the values are not assigned to the original object.
how to recreate:
import pandas as pd
def copy(df, new_df):
new_df = df.copy()
# an example of things that would be modified on new_df
new_df[0] = "test"
# just editing df as an example, to show that df is being changed while new_df is not receiving the values
df[0] = "test"
if __name__ == '__main__':
mat = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(mat)
new_df = pd.DataFrame()
copy(df, new_df)
print(new_df)
if you notice, in this case i am assigning "test" to the first column, in this case it does assign the values to the original object from the pointed object, but new_df do not get the new values.
is this a bug in pandas? or am i doing something wrong?
edit:
the assigning of values to df[0] is just an example of how the values do change on the original df.
my question is, how do i assign the values from the original df to a new df(it could also be concat, not only copy) without having to return the df and create a new variable which receives the returned value from the function

(Scroll down to Edit section to find the answer)
In your case,
the copied df is not returned and only scoped to the function. So the new_df outside the function is never assigned the new values.
The "test" was assigned to the "df" and not "new_df" after the copying is done. That's why the changes will not reflect when you print the "new_df" even if the function is correct.
Try this out.
import pandas as pd
def copy(df, new_df):
new_df = df.copy()
new_df[0] = "test"
return new_df
if __name__ == "__main__":
mat = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(mat)
new_df = pd.DataFrame()
new_df = copy(df, new_df)
print(new_df)
output
0 1 2
0 test 2 3
1 test 5 6
2 test 8 9
Edit
Sorry that I completely missed the part where you needed to have two linked DataFrames. you could just assign the df to a new variable without copying it.
Try this out:
import pandas as pd
def copy(df):
new_df = df
df[0] = "test val"
return df, new_df
if __name__ == "__main__":
mat = [
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
df = pd.DataFrame(mat)
df, new_df = copy(df)
print(df)
print(new_df)
output:
0 1 2
0 test val 2 3
1 test val 5 6
2 test val 8 9
0 1 2
0 test val 2 3
1 test val 5 6
2 test val 8 9

Related

pandas.DataFrame: aggregate rows based on regex

What I want to do
I have a trouble to clean my data because some values were not input correctly.
import pandas as pd
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Current output!!!!
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#102: WWW 4 8
#101: foo foo 5 10
## DO SOMETHING!!!!
print(df)
## Expected output!!!!
# column1 column2
#100: Test 2 4
#101: FOO 8 16
#102: WWW 4 8
My DataFrame.index consists of "ID" + "Name". However, names are not correct, so one ID may show up in more than one row.
Two requests
Sum up rows with the same ID.
Choose one name for the result. (For example, I can use either "Test" or "test" for ID=100.)
What I tried
I tried to use groupby function, but it doesn't seem to have regex compatibility.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
df2 = df.groupby(level=0).sum()
print(df2)
## Output
# column1 column2
#100: Test 1 2
#100: test 2 4
#101: FOO 3 6
#101: foo foo 5 10
#102: WWW 4 8
Environment
Python 3.10.5
Pandas 1.4.3
Your expected output for Test does not reflect that you are trying to do a summation, but from what I can gather this is what you want. groupby can take a function or a mapping or even a series as the by argument. Here, you just want the lowercase version of the index:
df.groupby(df.index.str.lower()).sum()
which gives
column1 column2
100: test 3 6
101: foo 8 16
102: www 4 8
Here, what I've done is passed it the lowercase index, and it simply groups the rows based on matching elements in the series.
Edit
Based on the updated question, to match the numbers, you can use regular expressions:
df.groupby(df.index.str.extract(r"(\d+):", expand=False)).sum()
which gives
column1 column2
100 3 6
101 8 16
102 4 8
It isn't clear what would take precedence 101: foo foo or 101: FOO, it seems like the numbers here are the important part regardless.
import numpy as np
import pandas as pd
# Data Import
data = [[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]]
index = ['100: Test', '100: test', '101: FOO', '102: WWW', '101: foo']
columns = ['column1', 'column2']
df = pd.DataFrame(data, index=index, columns=columns)
# Data Pre-process
df.reset_index(inplace=True)
df.rename(columns={'index':'ID_Name'},inplace=True)
df['ID'] = df['ID_Name'].str.split(':').str[0]
df.sort_values(['ID','ID_Name'],inplace=True)
df_group = df.groupby(['ID'])[['column1','column2']].sum().reset_index()
df_group
df = pd.merge(df,df_group,how='left',left_on='ID',right_on='ID')
df_final = df.groupby(['ID']).first()
# Data Clean Process
df_final.rename(columns={'column1_y':'column1','column2_y':'column2'},inplace= True)
df_final.drop(['column1_x','column2_x'],axis = 1 , inplace=True)
# Output Display
df_final
Hi Dmjy,
I have attached the code for you, please try from your side,
and if you still have any question please let me know
Thanks
Leon

pandas combine nested dataframes into one single dataframe

I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?
You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>
This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)
Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18

How to render a column name as a single cell in midst of multilevel columns in pandas?

I'm working on multilevel indexes in columns. I've to send these tables. For sending tables, I'm using df.to_html(). The picture below is where i am now. foo is the index which i've converted to column.
While converting to column, I want it to occupy both cells so it can look nice.This is what i want to achieve.
The code looks like this.
df = pd.DataFrame([[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]],index=['M1','M2','M3'])
df.columns = pd.MultiIndex.from_product([['x', 'y'], ['a', 'b']])
ind = df.index
df.reset_index(drop=True,inplace=True)
df.insert(0,'foo',ind)
With the code you provide, foois not set as the index of the dataframe.
Anyway, you could add this after your current code in order to correct the header of your dataframe before converting it to html:
df = df.rename(axis=1, level=0, mapper={"foo": ""}).rename(
axis=1, level=1, mapper={"": "foo"}
)
df.to_html(index=False)
This way, the html version of your dataframe renders the desired way:
x y
foo a b a b
M1 1 2 3 4
M2 1 2 3 4
M3 1 2 3 4

How to get rows from only some columns based on columns value in Pandas?

I use Pandas to get datas from Excel. From those tables, I often need to find one or some values in only one row, based on value in a column.
I've read a lot about Pandas (doc and SO), and almost everytime, the question is like « how to SELECT * FROM df WHERE value = smthing ».
But what I'd like to do is more like :
SELECT Col1, Col2
FROM df
WHERE Col3.value = smthing
And I can't find any answer.
For example :
>>> dataFrame
foo bar sm_else
0 0 3 6
1 1 4 7
2 2 5 8
I want to get foo value and sm_else value when bar == 4.
So :
foo sm_else
1 7
Result can be DataFrame or can be list or dict, I don't really care.
Thanks !
How can I achieve this ?
df.loc can help you out
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
print(df.loc[df['col2'] == 4][['col1', 'col2']])
df.loc[df.bar == 4, ['foo', 'sm_else']]

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Categories

Resources