getting a subset of arrays from a pandas data frame - python

I have a numpy array named arr with 1154 elements in it.
array([502, 502, 503, ..., 853, 853, 853], dtype=int64)
I have a data frame called df
team Count
0 512 11
1 513 21
2 515 18
3 516 8
4 517 4
How do I get the subset of the data frame df that includes the values only from the array arr
for eg:
team count
arr1_value1 45
arr1_value2 67
To make this question more clear:
I have a numpy array ['45', '55', '65']
I have a data frame as follows:
team count
34 156
45 189
53 90
65 99
23 77
55 91
I need a new data frame as follows:
team count
45 189
55 91
65 99

I don't know if that is a typo or not where your array values look like strings, assuming it is not and they are in fact ints then you can filter your df by calling isin:
In [6]:
a = np.array([45, 55, 65])
df[df.team.isin(a)]
Out[6]:
team count
1 45 189
3 65 99
5 55 91

You can use the DataFrame.loc method
Using your example (Notice that team is the index):
arr = np.array(['45', '55', '65'])
frame = pd.DataFrame([156, 189, 90, 99, 77, 91], index=['34', '45', '53', '65', '23', '55'])
ans = frame.loc[arr]
This sort of indexing is type sensitive, so if the frame.index is int then make sure your indexing array is also of type int, and not str like in this example.

I am answering the question asked after "To make this question more clear".
As a side note: the first 4 lines could have been provided by you, so I would not have to type them myself, which could also introduce errors/misunderstanding.
The idea is to create a Series as Index and then simply create a new dataframe based on that index. I just started with pandas, maybe this can be done more efficiently.
import numpy as np
import pandas as pd
# starting with the df and teams as string
df = pd.DataFrame(data={'team': [34, 45, 53, 65, 23, 55], 'count': [156, 189, 90, 99, 77, 91]})
teams = np.array(['45', '55', '65'])
# we want the team number as int
teams_int = [int(t) for t in teams]
# mini function to check, if the team is to be kept
def filter_teams(x):
return True if x in teams_int else False
# create the series as index and only keep those values from our original df
index = df['team'].apply(filter_teams)
df_filtered = df[index]
It returns this dataframe:
count team
1 189 45
3 99 65
5 91 55
Note that in this case, the df_filtered uses 1, 3, 5 as index (the indices sof the original dataframe). Your question is unclear about this, as the index is not shown to us.

Related

Pandas Dataframe - How to get a multiline cell separated by carriage return into multiple rows?

thank you for taking time to look into this. I'm a beginner programmer and struck at this.
#the dataframe is as follows for reference
data = [['\r8', 'tom', 10, '55\r \r \r62\r75'], ['18\r\r9', 'nick', 15, '77\r25\r85'], ['17\r19\r18', 'juli', 14, '55\r75\r85']]
df = pd.DataFrame(data, columns=['Roll No per Class', 'Name', 'Age', 'Highest Scores'])
This is a sample dataframe, the original one spans over more than 15,000 rows and 10 columns.
I want the /r cells to be placed into a new row with the other columns repeating.enter image description here
I have tried the code mentioned below
import numpy as np from itertools import chain
# return list from series of comma-separated strings def chainer(s):
return list(chain.from_iterable(s.str.split('\r')))
# calculate lengths of splits lens = df['Highest Scores'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate res = pd.DataFrame({'Name': np.repeat(df['Name'], lens),
'Age': np.repeat(df['Age'], lens),
'Roll No per Class': chainer(df['Roll No per Class']),
'Highest Scores': chainer(df['Highest Scores'])})
I'm getting the error:
ValueError: All arrays must be of the same length
I have also tried the code -
df.set_index(['Name', 'Age']).apply(lambda x: x.str.split('\r').explode()).reset_index()
It also gives an error :
ValueError: cannot handle a non-unique multi-index!
I'm guessing this is because the length of Roll number column doesn't match the length of Highest Scores column.
Can someone please help look into this. This is my first post so do let me know if there is anything missing and needs to be added.
You can split the cells at \r first,
>>> cols = ['Roll No per Class', 'Highest Scores']
>>> df[cols] = df[cols].apply(lambda col: col.str.split("\r"))
>>> df
Roll No per Class Name Age Highest Scores
0 [, , 8] tom 10 [55, 62, 75]
1 [18, , 9] nick 15 [77, 25, 85]
2 [17, 19, 18] juli 14 [55, 75, 85]
and explode them after:
>>> df.explode(cols)
Roll No per Class Name Age Highest Scores
0 tom 10 55
0 tom 10 62
0 8 tom 10 75
1 18 nick 15 77
1 nick 15 25
1 9 nick 15 85
2 17 juli 14 55
2 19 juli 14 75
2 18 juli 14 85

Dynamically creating nested lists of sequential numbers

I have a dataframe which is a subset of another dataframe and contains the following indexes: 45, 46, 47, 51, 52
Example dataframe:
price count
45 3909.0 8
46 3908.75 8
47 3908.50 8
51 3907.75 8
52 3907.5 8
I want to make 2 lists, each being its own list of the indexes that are sequential. (Example of this data format)
list[0] = [45, 46, 47]
list[1] = [51, 52]
Problem: The following code causes this error on the second to last line:
IndexError: list assignment index out of range
same_width_nodes = df.loc[df['count'] == width]
i = same_width_nodes.index[0]
seq = 0
sequences = [[]]
sequences[seq] = []
for index, row in same_width_nodes.iterrows():
if i == index:
i += 1
sequences[seq].append(index)
else:
seq += 1
sequences[seq] = [index]
i = index
Maybe there's a better way to achieve this, but I'd like to know why I can't create a new item in the sequences list as I am doing here, and how I should be doing it.
You can use this:
s_index=df.index.to_series()
l = s_index.groupby(s_index.diff().ne(1).cumsum()).agg(list).to_numpy()
Output:
l[0]
[45, 46, 47]
and
l[1]
[51, 52]
In steps.
First we do a rolling diff on your index, anything that is greater than 1 we code as True, we then apply a cumsum to create a new group per sequence.
45 0
46 0
47 0
51 1
52 1
Next, we use the groupby method with the new sequences to create your nested list inside a list comprehension
Setup.
df = pd.DataFrame([1,2,3,4,5],columns=['A'],index=[45,46, 47, 51, 52])
A
45 1
46 2
47 3
51 4
52 5
df['grp'] = df.assign(idx=df.index)['idx'].diff().fillna(1).ne(1).cumsum()
idx = [i.index.tolist() for _,i in df.groupby('grp')]
[[45, 46, 47], [51, 52]]
The issue is with this line
sequences[seq] = [index]
You are trying to assign the list an index which is not created. Instead do this.
sequences.append([index])
I use the diff to find when the index value diff changes greater than 1. I iterate the tuples and access by index their values.
index=[45,46,47,51,52]
price=[3909.0,3908.75,3908.50,3907.75,3907.5]
count=[8,8,8,8,8]
df=pd.DataFrame({'index':index,'price':price,'count':count})
df['diff']=df['index'].diff().fillna(0)
print(df)
result_list=[[]]
seq=0
for row in df.itertuples():
index=row[1]
diff=row[4]
if diff<=1:
result_list[seq].append(index)
else:
seq+=1
result_list.insert(1,[index])
print(result_list)
output:
[[45, 46, 47], [51, 52]]

Want to create a sparse matrix like dataframe from a dataframe in pandas/python

I have a data frame like this
I want to convert it to something like this,note the ds is the day someone visited,and will have values from 0 to 31, for the days not visited it will show 0, and for the days visited it will show 1. It's kinda like sparse matrix,can someone help
Adding to the solution from #sim. By using the parameter columns, one can avoid the join.
the sparse=True parameter will return a sparse matrix. sparse=False will return a dense matrix.
header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "fatin1bd#gmail.com"],
[22, 307, "shovonbad#gmail.com"],
[25, 411, "raisulk#gmail.com"],
[22, 588, "saiful.sdp#hotmail.com"],
[24, 664, "osman.dhk#gmail.com"]]
df = pd.DataFrame(data, columns=header)
df=pd.get_dummies(df,columns=['ds'],sparse=True)
If you use sparse=True, the result can be converted back to dense using sparse.to_dense()
on the specific column. For more details refer to User Guide
ds_cols=[col for col in df.columns if col.startswith('ds_')]
df=pd.concat([df[['buyer_id',"email_address"]],
df[ds_cols].sparse.to_dense()],axis=1)
Update: pd.get_dummies now accepts sparse=True to create a SparseArray output.
pd.get_dummies(s: pd.Series) can be used to create a one-hot encoding like such:
header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "fatin1bd#gmail.com"],
[22, 307, "shovonbad#gmail.com"],
[25, 411, "raisulk#gmail.com"],
[22, 588, "saiful.sdp#hotmail.com"],
[24, 664, "osman.dhk#gmail.com"]]
df = pd.DataFrame(data, columns=header)
df.join(pd.get_dummies(df["ds"]))
output:
ds buyer_id email_address 22 23 24 25
0 23 305 fatin1bd#gmail.com 0 1 0 0
1 22 307 shovonbad#gmail.com 1 0 0 0
2 25 411 raisulk#gmail.com 0 0 0 1
3 22 588 saiful.sdp#hotmail.com 1 0 0 0
4 24 664 osman.dhk#gmail.com 0 0 1 0
Just for added clarification: The resulting dataframe is still stored in a dense format. You could use scipy.sparse matrix formats to store it in a true sparse format.

Splitting a DataFrame into an Array Using Numpy

I have a file called data that looks like this:
Some Text Information (lines 1-6 in file)
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
What I'm trying to achieve is this something like this:
[[ 22. 23.]
[ 44. 44.]
[ 55. 55.]
[ 66. 66.]
[ 77. 77.]]
The issue I'm having is that the code I'm using doesn't properly split the data from the file. It ends up looking like this:
[ 1 22 23
0 2 44 44
1 3 55 55, Empty DataFrame
Columns: [1 6734 1453]
Index: [], 1 22 23
2 4 44 44
3 5 55 55
4 6 66 66
5 7 77 77
EOF]
Here's the code I'm using:
def loadFile(filename):
df1 = pd.read_fwf(filename, skiprows=6)
df1 = np.split(df, [2,2])
print('The data points:\n {}'.format(df1[:5]))
I understand the parameters of the split function. For instance, [2,2] should create two sub arrays from my dataframe and my axis is 0. However, why does it not properly split the array?
You can read file into pandas dataFrame and access the values attribute from it. Assuming "Some Text Information" is not the header:
import pandas as pd
df = pd.read_table(filepath, sep='\t', index_col= 0, skiprows = 6, header = None)
df.values # gives you the numpy ndarray
This should use the first column as index. Also you might need to remove the sep argument to let read_table figure it out. Also, try using other separators. If you get the row index in your data then try slicing to get desired results. Use something like:
df.iloc[:,1:].values
Do not use read_fwf, let pandas figure out the structure of your table:
df = pd.read_csv("yourfile", skiprows=6, header=None, sep='\s+')
To elaborate on ManKind_008's answer:
Your explicit line numbers are the problem. Pandas interprets these as valid data.
Using ManKinds solution does properly set the index column, but since your line numbers start at zero you end up with a DataFrame like:
pd.read_fwf('test.csv', header=None, index_col=0, skiprows=6)
1 2
0
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
Instead I suggest you read in all of your data using:
pd.read_fwf('test.csv', header=None, skiprows=6).iloc[:, 1:]
1 2
0 22 23
1 44 44
2 55 55
3 66 66
4 77 77
This leaves you with what you seem to need. The iloc call is ignoring the first row of data (your line numbers).
From here the df.values command will give you:
array([[22, 23],
[44, 44],
[55, 55],
[66, 66],
[77, 77]])
If you don't want a np.array, you can explicitly cast this to a list using the list() function.

PANDAS: Extracting values from a column by applying a condition on other columnns

I have the following table
Label B C
1 5 91
1 5 65
1 5 93
-1 5 54
-1 5 48
1 10 66
1 10 54
-1 10 15
I want only those values in C which are labeled '1' for each set of values in B.I want to extract those values from C in a list like this:
[[91 65 93],[66 54]]
Implementing similar thing in python is easy but i want to do same thing using pandas.
You can filter the df to just those values where Label is 1, then on the remaining columns groupby B and get the unique values of C:
In [26]:
gp = df[df['Label']==1][['B','C']].groupby('B')
gp['C'].unique()
Out[26]:
B
5 [91, 65, 93]
10 [66, 54]
Name: C, dtype: object
You can convert it to a list of arrays also:
In [36]:
list(gp['C'].unique().values)
Out[36]:
[array([91, 65, 93], dtype=int64), array([66, 54], dtype=int64)]
You can groupby the Label column, apply the list constructor. Here is an minimal example.
Label = [1, 1, 1, -1, -1, -1]
c = [91, 65, 93, 54, 48, 15]
df = pd.DataFrame({'Label': Label, 'c': c})
df['c'].groupby(df['Label']).apply(list)[1] # Change 1 to -1 if you want the -1 group
If you only want unique entries, then you can do
df['c'].groupby(df['Label']).unique()[1]
Not so nice as other answers:
first select label and get useful columns:
df2 = df[df['Label'] == 1][['B','C']].set_index('B')
Then just a list comprehension to get values
print [list(df2.ix[index]['C']) for index in set(df2.index)]
you get:
[[66, 54], [91, 65, 93]]

Categories

Resources