Get a subset of data from one row of Dataframe

Get a subset of data from one row of Dataframe - python

Let's say I have a dataframe df with columns 'A', 'B', 'C'
Now I just want to extract row 2 of df and only columns 'B' and 'C'. What is the most efficient way to do that?
Can you please tell me why df.ix[2, ['B', 'C']] didn't work?
Thank you!

Consider the dataframe df
df = pd.DataFrame(np.arange(9).reshape(3, 3), list('xyz'), list('ABC'))
df
A B C
x 0 1 2
y 3 4 5
z 6 7 8
If you want to maintain a dataframe
df.loc[df.index[[1]], ['B', 'C']]
B C
y 4 5
If you want a series
df.loc[df.index[1], ['B', 'C']]
B 4
C 5
Name: y, dtype: int64

row_2 = df[['B', 'C']].iloc[1]
OR
# Convert column to 2xN vector, grab row 2
row_2 = list(df[['B', 'C']].apply(tuple, axis=1))[1]

Related

Hierarchical Columns in Numpy

I'm new to Pandas and trying to recreate the following dataframe, such that values in columns A and B contain random numbers 0 through 8. However, I keep getting "ValueError: all arrays must be same length". Can someone please review my code ? Thank you!
DataFrame
df = pd.DataFrame(np.random.randint(0, high=9),index = [[1, 2, 3], ['a', 'b']],
columns = ['A', 'B'])

Since there are two layers to the index, you have to create a multi index:
df = pd.DataFrame(
np.random.randint(9, size=(6, 2)),
index=pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b']]),
columns=['A', 'B']
)
output:
A B
1 a 1 0
b 4 3
2 a 7 3
b 1 6
3 a 5 4
b 3 3

How to concatenate a pandas column by a partition?

I have a pandas data frame like this:
df = pd.DataFrame({"Id": [1, 1, 1, 2, 2, 2, 2],
"Letter": ['A', 'B', 'C', 'A', 'D', 'B', 'C']})
How can I add a new column efficiently, "Merge" such that it concatenates all the values from the column "letter" by "Id", so the final data frame would look like this:

You can groupby Id column then transform
df['Merge'] = df.groupby('Id').transform(lambda x: '-'.join(x))
print(df)
Id Letter Merge
0 1 A A-B-C
1 1 B A-B-C
2 1 C A-B-C
3 2 A A-D-B-C
4 2 D A-D-B-C
5 2 B A-D-B-C
6 2 C A-D-B-C
Thanks for sammywemmy pointing out lambda is needless here, so you can use a simpler form
df['Merge'] = df.groupby('Id').transform('-'.join)

Filter Columns from Pandas Dataframe with given list when list elements may or may not be present as column

I have a huge dataframe and I need to filter out the columns from the dataframe if the columns are present in a given list.
For example,
df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]], columns=list('ABCDE'))
This is the dataframe.
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
I have a list.
fil_lst = ['A', 'D', 'F']
The list may contain column names that are not present in the dataframe. I need only the columns that are present in the dataframe.
I need the resulting dataframe like,
A D
0 1 4
1 6 9
I know it can be done with the help of list comprehension like,
new_df = df[[col for col in fil_lst if col in df.columns]]
But as I have a huge dataframe, it is better if I don't use this computationally expensive process.
Is it possible to vectorize this in any way?

Use Index.isin for test membership in columns and DataFrame.loc for filter by columns, so : mean select all rows and columns by mask:
fil_lst = ['A', 'D', 'F']
df = df.loc[:, df.columns.isin(fil_lst)]
print(df)
A D
0 1 4
1 6 9
Or use Index.intersection:
fil_lst = ['A', 'D', 'F']
df = df[df.columns.intersection(fil_lst)]
print(df)
A D
0 1 4
1 6 9

If you are dealing with large lists, and the focus is on performance more than order of columns, you can use set intersection:
In [2944]: fil_lst = ['A', 'D', 'F']
In [2945]: col_list = df.columns.tolist()
In [2947]: df = df[list(set(col_list) & set(fil_lst))]
In [2947]: df
Out[2947]:
D A
0 4 1
1 9 6
EDIT: If order of columns is important, then do this:
In [2953]: df = df[sorted(set(col_list) & set(fil_lst), key = col_list.index)]
In [2953]: df
Out[2953]:
A D
0 1 4
1 6 9

Merge pandas dataframe with overwrite of columns

What is the quickest way to merge to python data frames in this manner?
I have two data frames with similar structures (both have a primary key id and some value columns).
What I want to do is merge the two data frames based on id. Are there any ways do this based on pandas operations? How I've implemented it right now is as coded below:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
a_dict = {e[id]: e for e in a.to_dict('record')}
b_dict = {e[id]: e for e in b.to_dict('record')}
c_dict = a_dict.copy()
c_dict.update(b_dict)
c = pd.DataFrame(list(c.values())
Here, c would be equivalent to
pd.DataFrame({'id': [1,2,3,4], 'letter':['A','b', 'C', 'D']})
id letter
0 1 A
1 2 b
2 3 C
3 4 D

combine_first
If 'id' is your primary key, then use it as your index.
b.set_index('id').combine_first(a.set_index('id')).reset_index()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
merge with groupby
a.merge(b, 'outer', 'id').groupby(lambda x: x.split('_')[0], axis=1).last()
id letter
0 1 A
1 2 b
2 3 C
3 4 D

One way may be as following:
append dataframe a to dataframe b
drop duplicates based on id
sort values on remaining by id
reset index and drop older index
You can try:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
c = b.append(a).drop_duplicates(subset='id').sort_values('id').reset_index(drop=True)
print(c)

Try this
c = pd.concat([a, b], axis=0).sort_values('letter').drop_duplicates('id', keep='first').sort_values('id')
c.reset_index(drop=True, inplace=True)
print(c)
id letter
0 1 A
1 2 b
2 3 C
3 4 D

Sort or groupby dataframe in python using given string

I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data

One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B

Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get a subset of data from one row of Dataframe - python

Let's say I have a dataframe df with columns 'A', 'B', 'C' Now I just want to extract row 2 of df and only columns 'B' and 'C'. What is the most efficient way to do that? Can you please tell me why df.ix[2, ['B', 'C']] didn't work? Thank you!

Consider the dataframe df df = pd.DataFrame(np.arange(9).reshape(3, 3), list('xyz'), list('ABC')) df A B C x 0 1 2 y 3 4 5 z 6 7 8 If you want to maintain a dataframe df.loc[df.index[[1]], ['B', 'C']] B C y 4 5 If you want a series df.loc[df.index[1], ['B', 'C']] B 4 C 5 Name: y, dtype: int64

row_2 = df[['B', 'C']].iloc[1] OR # Convert column to 2xN vector, grab row 2 row_2 = list(df[['B', 'C']].apply(tuple, axis=1))[1]

Related

Hierarchical Columns in Numpy

How to concatenate a pandas column by a partition?

Filter Columns from Pandas Dataframe with given list when list elements may or may not be present as column

Merge pandas dataframe with overwrite of columns

Sort or groupby dataframe in python using given string

Categories

Resources