subset the dataframe into a new one using copy [duplicate] - python

This question already has answers here:
why should I make a copy of a data frame in pandas
(8 answers)
Closed 4 years ago.
I have a dataframe df
a b c
0 5 6 9
1 6 7 10
2 7 8 11
3 8 9 12
So if I want to select only col a and b and store it in another df I would use something like this
df1 = df[['a','b']]
But I have seen places where people write it this way
df1 = df[['a','b']].copy()
Can anyone let me know what is .copy() because the earlier code works just fine.

For example, if you want to rename a dataframe (example using replace):
df2=df
df2=df2.replace('blah','foo')
Here:
df==df2
Will be:
True
You want it to only do to, df2:
df2=df.copy()
df2=df2.replace('blah','foo')
Then now:
df==df2
Returns:
False

Related

Pandas Drop Duplicates And Store Duplicates [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 months ago.
i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.
Is there a way to save the data in a new list before removing it?
I have unfortunately found in the documentation of pandas no information on this.
Thanks for the answer.
It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:
m = df.duplicated('group')
dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)
df = df[~m]
print(df)
Output:
# print(dropped)
group
A [2]
B [4, 5]
C [7]
Name: value, dtype: object
# print(df)
group value
0 A 1
2 B 3
5 C 6
Used input:
group value
0 A 1
1 A 2
2 B 3
3 B 4
4 B 5
5 C 6
6 C 7

Combining pandas rows based on condition [duplicate]

This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
Given a Pandas Dataframe df, with column names 'Session', and 'List':
Can I group together the 'List' values for the same values of 'Session'?
My Approach
I've tried solving the problem by creating a new dataframe, and iterating through the rows of the inital dataframe while maintaing a session counter that I increment if I see that the session has changed.
If it hasn't changed, then I append the List value that corresponds to that rows value with a comma.
Whenever the session changes, I used strip to get rid of the last comma (extra).
Initial DataFrame
Session List
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 3 f
Required DataFrame
Session List
0 1 a,b,c
1 2 d,e
2 3 f
Can someone suggest something more efficient or simple?
Thank you in advance.
Use groupby and apply and reset_index:
>>> df.groupby('Session')['List'].agg(','.join).reset_index()
Session List
0 1 a,b,c
1 2 d,e
2 3 f
>>>

How do I clean phone numbers in pandas [duplicate]

This question already has answers here:
How to only do string manupilation on column of pandas that have 4 digits or less?
(3 answers)
Closed 3 years ago.
I have a pandas dataframe with a column for Phone however, the data is a bit inconsistent. Here are some examples that I would like to focus on.
df["Phone"]
0 732009852
1 738073222
2 755920306
3 0755353288
Row 3 has the necessary leading 0 for an Australian number. How do I update rows like 0,1 and 2?
Use pandas.Series.str.zfill:
s = pd.Series(['732009852', '0755353288'])
s.str.zfill(10)
Output:
0 0732009852
1 0755353288
Or pd.Series.str.rjust:
print(df["Phone"].str.rjust(10, '0'))
Output:
0 0732009852
1 0738073222
2 0755920306
3 0755353288

How to delete many columns in python with one line of code? [duplicate]

This question already has answers here:
Deleting multiple columns based on column names in Pandas
(11 answers)
Closed 3 years ago.
I am trying to delete the following columns on my dataframe: 1,2,101:117,121:124,126.
So far the two ways I have found to delete columns is:
df.drop(df.columns[2:6],axis=1)
df.drop(df.columns[[0,3,5]],axis=1)
however if I try
df.drop(df.columns[1,2,101:117,121:124],axis=1)
I get a "too many indices" error
I also tried this
a=df.drop(df.columns[[1,2]],axis=1)
b=a.drop(a.columns[99:115],axis=1)
c=b.drop(b.columns[102:105],axis=1)
d=c.drop(c.columns[103],axis=1)
but this isn't deleting the columns I'm wanting to for some reason.
Use np.r_ to slice:
import numpy as np
df.drop(columns=df.columns[np.r_[1, 2, 101:117, 121:124, 126]])
import pandas pd
df = pd.DataFrame(np.random.randint(1, 10, (2, 130)))
df.drop(columns=df.columns[np.r_[1, 2, 101:117, 121:124, 126]])
# 0 3 4 5 6 ... 120 124 125 127
#0 6 1 3 7 2 ... 8 7 2 6
#1 1 9 2 5 3 ... 7 3 9 4
This should work:
df.drop(df.columns[[indexes_of_columns_you_want_to_delete]], axis=1,
inplace=True)
Please try this:
import numpy as np
import pandas as pd
input_df.drop(input_df.columns[[np.r_[0,2:4]]],axis=1, inplace = True)

Error subsetting a data frame in python [duplicate]

This question already has an answer here:
Python - splitting dataframe into multiple dataframes based on column values and naming them with those values [duplicate]
(1 answer)
Closed 4 years ago.
I am learning python and pandas and am having trouble overcoming an error while trying to subset a data frame.
I have an input data frame:
df0-
Index Group Value
1 A 10
2 A 15
3 B 20
4 C 10
5 C 10
df0.dtypes-
Group object
Value float64
That I am trying to split out into unique values based off of the Group column. With the output looking something like this:
df1-
Index Group Value
1 A 10
2 A 15
df2-
Index Group Value
3 B 20
df3-
Index Group Value
4 C 10
5 C 10
So far I have written this code to subset the input:
UniqueGroups = df0['Group'].unique().tolist()
OutputFrame = {}
for x in UniqueAgencies:
ReturnFrame[str('ConsolidateReport_')+x] = UniqueAgencies[df0['Group']==x]
The code above returns the following error, which I can`t quite work my head around. Can anyone point me in the right direction?
*** TypeError: list indices must be integers or slices, not str
you can use groupby to group the column
for _, g in df0.groupby('Group'):
print g

Categories

Resources