Pandas : Data Frame Pruning

Pandas : Data Frame Pruning - python

I have a data frame as given below:
data = [['1','tom',1,0],['1','tom',0,1],['2','lynda',0,1],['2','lynda',0,1]]
df = pd.DataFrame(data, columns = ['ID','NAME', 'A','B'])
df.head()
I want to transform the dataframe to look like the below:
where in logical OR is taken for columns A and B. ID and NAME will always have same pair-values irrespective of how many times they appear but columns A and B can change(00,10,11,01).
So at the end I want ID,NAME,A,B.

You can always sum and compare to 0.
data = [['1','tom',1,0],['1','tom',0,1],['2','lynda',0,1],['2','lynda',0,1]]
df = pd.DataFrame(data, columns = ['ID','NAME', 'A','B'])
g_df = (df.groupby(['ID', 'NAME']).sum() >0).astype(float)
g_df.reset_index()

Related

How to sort panda dataframe based on numbers in column name

I have a data file with column names like this (numbers in the name from 1 to 32):
inlet_left_cell-10<stl-unit=m>-imprint)
inlet_left_cell-11<stl-unit=m>-imprint)
inlet_left_cell-12<stl-unit=m>-imprint)
-------
inlet_left_cell-9<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
I would like to sort the columns (with data) from left to right in python based on the number in the columns. I need to move a whole column based on the number in the column name.
So xxx-1xxx, xxx-2xx, xxx-3xxx, ...... xxx-32xxx
inlet_left_cell-1<stl-unit=m>-imprint)
inlet_left_cell-2<stl-unit=m>-imprint)
inlet_left_cell-3<stl-unit=m>-imprint)
-------
inlet_left_cell-32<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
Is there any way to do this in python ? Thanks.

Here is the solution
# Some random data
data = np.random.randint(1,10, size=(100,32))
# Setting up column names as given in your problem randomly ordered
columns = [f'inlet_left_cell-{x}<stl-unit=m>-imprint)' for x in range(1,33)]
np.random.shuffle(columns)
# Creating the dataframe
df = pd.DataFrame(data, columns=columns)
df.head()
# Sorting the columns in required order
col_nums = [int(x.split('-')[1].split('<')[0]) for x in df.columns]
column_map = dict(zip(col_nums, df.columns))
df = df[[column_map[i] for i in range(1,33)]]
df.head()

There many ways to do it...I'm just posting simply way.
Simply extract column names & sort them using natsort.
Assuming Dataframe name as df..
from natsort import natsorted, ns
dfl=list(df) #used to convert column names to list
dfl=natsorted(dfl, alg=ns.IGNORECASE) # sort based on subtsring numbers
df_sorted= df[dfl] #Re arrange Df
print(df_sorted)

If the column names differ only by this number, try this:
import pandas as pd
data = pd.read_excel("D:\\..\\file_name.xlsx")
data = data.reindex(sorted(data.columns), axis=1)
For example:
data = pd.DataFrame(columns=["inlet_left_cell-23<stl-unit=m>-imprint)", "inlet_left_cell-47<stl-unit=m>-imprint)", "inlet_left_cell-10<stl-unit=m>-imprint)", "inlet_left_cell-12<stl-unit=m>-imprint)"])
print(data)
inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint) inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint)
After this:
data = data.reindex(sorted(data.columns), axis=1)
print(data)
inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint) inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint)

Concatenate two dataframes with different row indices

I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.

Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.

How can I drop rows with certain values from a dataframe?

I'm taking two different datasets and merging them into a single data frame, but I need to take one of the columns ('Presunto Responsable') of the resulting data frame and remove the rows with the value 'Desconocido' in it.
This is my code so far:
#%% Get data
def getData(path_A, path_B):
victims = pd.read_excel(path_A)
dfv = pd.DataFrame(data=victims)
cases = pd.read_excel(path_B)
dfc = pd.DataFrame(data=cases)
return dfv, dfc
#%% merge dataframes
def mergeData(data_A, data_B):
data = pd.DataFrame()
#merge dataframe avoiding duplicated colums
cols_to_use = data_B.columns.difference(data_A.columns)
data = pd.merge(data_A, data_B[cols_to_use], left_index=True, right_index=True, how='outer')
cols_at_end = ['Presunto Responsable']
#Take 'Presunto Responsable' at the end of the dataframe
data = data[[c for c in data if c not in cols_at_end]
+ [c for c in cols_at_end if c in data]]
return data
#%% Drop 'Desconocido' values in 'Presunto Responsable'
def dropData(data):
indexNames = data[data['Presunto Responsable'] == 'Desconocido'].index
for c in indexNames:
data.drop(indexNames , inplace=True)
return data
The resulting dataframe still has the rows with 'Desconocido' values in them. What am I doing wrong?

You can just say:
data = data[data['Presunto Responsable'] != 'Desconocido']
Also, btw, when you do pd.read_excel() it creates a dataframe, you don't need to then pass that into pd.DataFrame().

How to merge columns interspersing the data?

I'm new to python and pandas and working to create a Pandas MultiIndex with two independent variables: flow and head which create a dataframe and I have 27 different design points. It's currently organized in a single dataframe with columns for each variable and rows for each design point.
Here's how I created the MultiIndex:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
I then created three columns of data:
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
And want to merge these into a single column of data such that I can use this command:
pc = pd.DataFrame(data, index=index)
Using the three columns with the same indexes for the rows (0-27), I want to merge the columns into a single column such that the data is interspersed. If I call the columns col1, col2 and col3 and I denote the index in parentheses such that col1(0) indicates column1 index 0, I want the data to look like:
col1(0)
col2(0)
col3(0)
col1(1)
col2(1)
col3(1)
col1(2)...

it is a bit confusing. But what I understood is that you are trying to do this:
flow = df.loc[0, ["Mass_Flow_Rate", "Mass_Flow_Rate.1",
"Mass_Flow_Rate.2"]]
dp = df.loc[:,"Design Point"]
index = pd.MultiIndex.from_product([dp, flow], names=
['DP','Flows'])
df0 = df.loc[:,"Head2D"]
df1 = df.loc[:,"Head2D.1"]
df2 = df.loc[:,"Head2D.1"]
data = pd.concat[df0, df1, df2]
pc = pd.DataFrame(data=data, index=index)

change specific columns of data frame using pandas

I have a data frame 3 columns: ["date", "volume", "ID"]. ID=0,1,2...15.
I would like to create new data frame: keep all rows with ID=5.
Other rows: still keep them all, but set the row["volume"] = 0.

First copy your dataframe:
df_new = df.copy()
Then, using pd.DataFrame.loc, set volume to 0 for your criteria:
df_new.loc[df_new['ID'] != 5, 'volume'] = 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas : Data Frame Pruning - python

You can always sum and compare to 0. data = [['1','tom',1,0],['1','tom',0,1],['2','lynda',0,1],['2','lynda',0,1]] df = pd.DataFrame(data, columns = ['ID','NAME', 'A','B']) g_df = (df.groupby(['ID', 'NAME']).sum() >0).astype(float) g_df.reset_index()

Related

How to sort panda dataframe based on numbers in column name

Concatenate two dataframes with different row indices

How can I drop rows with certain values from a dataframe?

How to merge columns interspersing the data?

change specific columns of data frame using pandas

Categories

Resources