How to compare column names of 2 different Pandas data frame. I want to compare train and test data frames where there are some columns missing in test Data frames??
pandas.Index objects, including dataframe columns, have useful set-like methods, such as intersection and difference.
For example, given dataframes train and test:
train_cols = train.columns
test_cols = test.columns
common_cols = train_cols.intersection(test_cols)
train_not_test = train_cols.difference(test_cols)
train_column = train.columns
test_column = test.columns
common_column = train_column.intersection(test_column)
train_not_in_test = train_column.difference(test_column)
Related
I have a data file with column names like this (numbers in the name from 1 to 32):
inlet_left_cell-10<stl-unit=m>-imprint)
inlet_left_cell-11<stl-unit=m>-imprint)
inlet_left_cell-12<stl-unit=m>-imprint)
-------
inlet_left_cell-9<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
I would like to sort the columns (with data) from left to right in python based on the number in the columns. I need to move a whole column based on the number in the column name.
So xxx-1xxx, xxx-2xx, xxx-3xxx, ...... xxx-32xxx
inlet_left_cell-1<stl-unit=m>-imprint)
inlet_left_cell-2<stl-unit=m>-imprint)
inlet_left_cell-3<stl-unit=m>-imprint)
-------
inlet_left_cell-32<stl-unit=m>-imprint)
data
data
data
data
data
data
data
data
data
data
....
....
...
...
....
Is there any way to do this in python ? Thanks.
Here is the solution
# Some random data
data = np.random.randint(1,10, size=(100,32))
# Setting up column names as given in your problem randomly ordered
columns = [f'inlet_left_cell-{x}<stl-unit=m>-imprint)' for x in range(1,33)]
np.random.shuffle(columns)
# Creating the dataframe
df = pd.DataFrame(data, columns=columns)
df.head()
# Sorting the columns in required order
col_nums = [int(x.split('-')[1].split('<')[0]) for x in df.columns]
column_map = dict(zip(col_nums, df.columns))
df = df[[column_map[i] for i in range(1,33)]]
df.head()
There many ways to do it...I'm just posting simply way.
Simply extract column names & sort them using natsort.
Assuming Dataframe name as df..
from natsort import natsorted, ns
dfl=list(df) #used to convert column names to list
dfl=natsorted(dfl, alg=ns.IGNORECASE) # sort based on subtsring numbers
df_sorted= df[dfl] #Re arrange Df
print(df_sorted)
If the column names differ only by this number, try this:
import pandas as pd
data = pd.read_excel("D:\\..\\file_name.xlsx")
data = data.reindex(sorted(data.columns), axis=1)
For example:
data = pd.DataFrame(columns=["inlet_left_cell-23<stl-unit=m>-imprint)", "inlet_left_cell-47<stl-unit=m>-imprint)", "inlet_left_cell-10<stl-unit=m>-imprint)", "inlet_left_cell-12<stl-unit=m>-imprint)"])
print(data)
inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint) inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint)
After this:
data = data.reindex(sorted(data.columns), axis=1)
print(data)
inlet_left_cell-10<stl-unit=m>-imprint) inlet_left_cell-12<stl-unit=m>-imprint) inlet_left_cell-23<stl-unit=m>-imprint) inlet_left_cell-47<stl-unit=m>-imprint)
I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.
Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.
I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.
I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so
import pandas as pd
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
df2 = df.loc[df['itemid'] == i]
df2 = df2.sort_values(by='Date',ascending=True)
trainvals = df2[:int(len(df2)*0.8)]
testvals = df2[int(len(df2)*0.8):]
train.append(trainvals)
test.append(testvals)
It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?
Your immediate issue is not re-assigning inside for-loop:
train = train.append(trainvals)
test = test.append(testvals)
However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat to bind each set together. Use a defined method to organize processing.
def split_dfs(df):
df = df.sort_values(by='Date')
trainvals = df[:int(len(df)*0.8)]
testvals = df[int(len(df)*0.8):]
return {'train': trainvals, 'test': testvals}
dfs = [split_dfs(df) for g,df in df.groupby['itemid']]
train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])
You can avoid the loop with df.groupby.quantile.
train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train
Note this could have unexpected behavior if df.index is not unique.
I am trying to filter pandas dataframe columns (with type pandas.core.index.Index) by a partial label.
I am searching for a builtin method that achieve the same result as:
partial_label = 'partial_lab'
columns = df.columns
columns = [c for c in columns if c.startswith(partial_label)]
df = df[columns]
Is there anything builtin to obtain this?
Thanks
possible solutions:
df.filter(regex='partial_lab.*')
or
idx = df.columns.to_series().str.startswith('partial_lab')
df.loc[:,idx]