I have a dataframe which is a subset of another dataframe and contains the following indexes: 45, 46, 47, 51, 52
Example dataframe:
price count
45 3909.0 8
46 3908.75 8
47 3908.50 8
51 3907.75 8
52 3907.5 8
I want to make 2 lists, each being its own list of the indexes that are sequential. (Example of this data format)
list[0] = [45, 46, 47]
list[1] = [51, 52]
Problem: The following code causes this error on the second to last line:
IndexError: list assignment index out of range
same_width_nodes = df.loc[df['count'] == width]
i = same_width_nodes.index[0]
seq = 0
sequences = [[]]
sequences[seq] = []
for index, row in same_width_nodes.iterrows():
if i == index:
i += 1
sequences[seq].append(index)
else:
seq += 1
sequences[seq] = [index]
i = index
Maybe there's a better way to achieve this, but I'd like to know why I can't create a new item in the sequences list as I am doing here, and how I should be doing it.
You can use this:
s_index=df.index.to_series()
l = s_index.groupby(s_index.diff().ne(1).cumsum()).agg(list).to_numpy()
Output:
l[0]
[45, 46, 47]
and
l[1]
[51, 52]
In steps.
First we do a rolling diff on your index, anything that is greater than 1 we code as True, we then apply a cumsum to create a new group per sequence.
45 0
46 0
47 0
51 1
52 1
Next, we use the groupby method with the new sequences to create your nested list inside a list comprehension
Setup.
df = pd.DataFrame([1,2,3,4,5],columns=['A'],index=[45,46, 47, 51, 52])
A
45 1
46 2
47 3
51 4
52 5
df['grp'] = df.assign(idx=df.index)['idx'].diff().fillna(1).ne(1).cumsum()
idx = [i.index.tolist() for _,i in df.groupby('grp')]
[[45, 46, 47], [51, 52]]
The issue is with this line
sequences[seq] = [index]
You are trying to assign the list an index which is not created. Instead do this.
sequences.append([index])
I use the diff to find when the index value diff changes greater than 1. I iterate the tuples and access by index their values.
index=[45,46,47,51,52]
price=[3909.0,3908.75,3908.50,3907.75,3907.5]
count=[8,8,8,8,8]
df=pd.DataFrame({'index':index,'price':price,'count':count})
df['diff']=df['index'].diff().fillna(0)
print(df)
result_list=[[]]
seq=0
for row in df.itertuples():
index=row[1]
diff=row[4]
if diff<=1:
result_list[seq].append(index)
else:
seq+=1
result_list.insert(1,[index])
print(result_list)
output:
[[45, 46, 47], [51, 52]]
Related
After this discussion, I have the following dataframe:
data = {'Item':['1', '2', '3', '4', '5'],
'Len':[142, 11, 50, 60, 12],
'Hei':[55, 65, 130, 14, 69],
'C':[68, -18, 65, 16, 17],
'Thick':[60, 0, -150, 170, 130],
'Vol':[230, 200, -500, 10, 160]
'Fail':[['Len', 'Thick'], ['Thick'], ['Hei', 'Thick', 'Vol'], ['Vol'], ""}
df = pd.DataFrame(data)
representing different items and the corresponding values related to some of their parameters (Le, Hei, C, ...). In the column Fail are reported the parameters that are failed, e. g. item 1 fails for parameters Len and Thick, item 3 fails for parameters B, Thick and Vol, while item 4 shows no failure.
For each item I need a new column where it is reported the failed parameter together with its value, in the following format: failed parameter = value. So, for the first item I should get Len=142 and Thick=60.
So far, I have exploded the Fail column into multiple columns:
failed_param = df['Fail'].apply(pd.Series)
failed_param = failed_param.rename(columns = lambda x : 'Failed_param_' + str(x +1 ))
df2_list = failed_param.columns.values.tolist()
df2 = pd.concat([df[:], failed_param[:]], axis=1)
Then, if I do the following:
for name in df2_list:
df2.loc[df2[f"{name}"] == "D", "new"] = "D"+ "=" + df2["D"].map(str)
I can get what I need but for only one parameter (D in this case). How can I obtain the same for all the parameters all at once?
As mentioned in the question, you need to insert a new column (e.g., FailParams) that contains a list of strings. Each string represents the items' failures (e.g., Len=142,Thick=60). A quick solution can be:
import pandas as pd
data = {
'Item' : ['1', '2', '3', '4', '5'],
'Len' : [142, 11, 50, 60, 12],
'Hei' : [55, 65, 130, 14, 69],
'C' : [68, -18, 65, 16, 17],
'Thick': [60, 0, -150, 170, 130],
'Vol' : [230, 200, -500, 10, 160],
'Fail' : [['Len', 'Thick'], ['Thick'], ['Hei', 'Thick', 'Vol'], ['Vol'], []]
}
# Convert the dictionary into a DataFrame.
df = pd.DataFrame(data)
# The first solution: using list comprehension.
column = [
",".join( # Add commas between the list items.
# Find the target items and their values.
[el + "=" + str(df.loc[int(L[0]) - 1, el]) for el in L[1]]
)
if (len(L[1]) > 0) else "" # If the Fail inner is empty, return an empty string.
for L in zip(df['Item'].values, df['Fail'].values) # Loop on the Fail items.
]
# Insert the new column.
df['FailParams'] = column
# Print the DF after insertion.
print(df)
The previous solution is added using list comprehension. Another solution using loops can be:
# The second solution: using loops.
records = []
for L in zip(df['Item'].values, df['Fail'].values):
if (len(L[1]) <= 0):
record = ""
else:
record = ",".join([el + "=" + str(df.loc[int(L[0]) - 1, el]) for el in L[1]])
records.append(record)
print(records)
# Insert the new column.
df['FailParams'] = records
# Print the DF after insertion.
print(df)
A sample output should be:
Item Len Hei C Thick Vol Fail FailParams
0 1 142 55 68 60 230 [Len, Thick] Len=142,Thick=60
1 2 11 65 -18 0 200 [Thick] Thick=0
2 3 50 130 65 -150 -500 [Hei, Thick, Vol] Hei=130,Thick=-150,Vol=-500
3 4 60 14 16 170 10 [Vol] Vol=10
4 5 12 69 17 130 160 []
It might be a good idea to build an intermediate representation first, something like this (I am assuming the empty cell in the Fail column is an empty list [] so as to match the datatype of the other values):
# create a Boolean mask to filter failed values
m = df.apply(lambda row: row.index.isin(row.Fail),
axis=1,
result_type='broadcast')
>>> df[m]
Item Len Hei C Thick Vol Fail
0 NaN 142.0 NaN NaN 60.0 NaN NaN
1 NaN NaN NaN NaN 0.0 NaN NaN
2 NaN NaN 130.0 NaN -150.0 -500.0 NaN
3 NaN NaN NaN NaN NaN 10.0 NaN
4 NaN NaN NaN NaN NaN NaN NaN
This allows you to actually do something with the failed values, too.
With that in place, generating the value list could be done by something similar to Hossam Magdy Balaha's answer, perhaps with a little function:
def join_params(row):
row = row.dropna().to_dict()
return ', '.join(f'{k}={v}' for k,v in row.items())
>>> df[m].apply(join_params, axis=1)
0 Len=142.0, Thick=60.0
1 Thick=0.0
2 Hei=130.0, Thick=-150.0, Vol=-500.0
3 Vol=10.0
4
dtype: object
I would like to convert the first 50 items in a large pandas dataframe into a list, that for each index in the dataframe the list would have the value. And even if the dataframe doesn't have any values in that index, I would like the list to have the value 0.
For example the pandas dataframe which looks like this:
ID Count
0 20
1 50
2 60
4 90
5 20
.
49 65
.
9999999 60054
would be converted to the following list, with only the first 50 elements of the dataframe being relevant:
[20, 50, 60, 0, 90, 20......,65]
Note that at index=3 , the value in the list is 0, because the ID was not found in the pandas dataframe.
If I understand correctly:
mylist = (df.iloc[:50].set_index('ID')
.reindex(range(50), fill_value=0)['Count']
.tolist())
IIUC:
d = df.query('ID < 5')
m = dict(zip(*map(d.get, d)))
[m.get(i, 0) for i in range(5)]
[20, 50, 60, 0, 90]
I have a numpy array named arr with 1154 elements in it.
array([502, 502, 503, ..., 853, 853, 853], dtype=int64)
I have a data frame called df
team Count
0 512 11
1 513 21
2 515 18
3 516 8
4 517 4
How do I get the subset of the data frame df that includes the values only from the array arr
for eg:
team count
arr1_value1 45
arr1_value2 67
To make this question more clear:
I have a numpy array ['45', '55', '65']
I have a data frame as follows:
team count
34 156
45 189
53 90
65 99
23 77
55 91
I need a new data frame as follows:
team count
45 189
55 91
65 99
I don't know if that is a typo or not where your array values look like strings, assuming it is not and they are in fact ints then you can filter your df by calling isin:
In [6]:
a = np.array([45, 55, 65])
df[df.team.isin(a)]
Out[6]:
team count
1 45 189
3 65 99
5 55 91
You can use the DataFrame.loc method
Using your example (Notice that team is the index):
arr = np.array(['45', '55', '65'])
frame = pd.DataFrame([156, 189, 90, 99, 77, 91], index=['34', '45', '53', '65', '23', '55'])
ans = frame.loc[arr]
This sort of indexing is type sensitive, so if the frame.index is int then make sure your indexing array is also of type int, and not str like in this example.
I am answering the question asked after "To make this question more clear".
As a side note: the first 4 lines could have been provided by you, so I would not have to type them myself, which could also introduce errors/misunderstanding.
The idea is to create a Series as Index and then simply create a new dataframe based on that index. I just started with pandas, maybe this can be done more efficiently.
import numpy as np
import pandas as pd
# starting with the df and teams as string
df = pd.DataFrame(data={'team': [34, 45, 53, 65, 23, 55], 'count': [156, 189, 90, 99, 77, 91]})
teams = np.array(['45', '55', '65'])
# we want the team number as int
teams_int = [int(t) for t in teams]
# mini function to check, if the team is to be kept
def filter_teams(x):
return True if x in teams_int else False
# create the series as index and only keep those values from our original df
index = df['team'].apply(filter_teams)
df_filtered = df[index]
It returns this dataframe:
count team
1 189 45
3 99 65
5 91 55
Note that in this case, the df_filtered uses 1, 3, 5 as index (the indices sof the original dataframe). Your question is unclear about this, as the index is not shown to us.
I have the following table
Label B C
1 5 91
1 5 65
1 5 93
-1 5 54
-1 5 48
1 10 66
1 10 54
-1 10 15
I want only those values in C which are labeled '1' for each set of values in B.I want to extract those values from C in a list like this:
[[91 65 93],[66 54]]
Implementing similar thing in python is easy but i want to do same thing using pandas.
You can filter the df to just those values where Label is 1, then on the remaining columns groupby B and get the unique values of C:
In [26]:
gp = df[df['Label']==1][['B','C']].groupby('B')
gp['C'].unique()
Out[26]:
B
5 [91, 65, 93]
10 [66, 54]
Name: C, dtype: object
You can convert it to a list of arrays also:
In [36]:
list(gp['C'].unique().values)
Out[36]:
[array([91, 65, 93], dtype=int64), array([66, 54], dtype=int64)]
You can groupby the Label column, apply the list constructor. Here is an minimal example.
Label = [1, 1, 1, -1, -1, -1]
c = [91, 65, 93, 54, 48, 15]
df = pd.DataFrame({'Label': Label, 'c': c})
df['c'].groupby(df['Label']).apply(list)[1] # Change 1 to -1 if you want the -1 group
If you only want unique entries, then you can do
df['c'].groupby(df['Label']).unique()[1]
Not so nice as other answers:
first select label and get useful columns:
df2 = df[df['Label'] == 1][['B','C']].set_index('B')
Then just a list comprehension to get values
print [list(df2.ix[index]['C']) for index in set(df2.index)]
you get:
[[66, 54], [91, 65, 93]]
This is the format of my data:
Date hits returning
2014/02/06 10 0
2014/02/06 25 0
2014/02/07 11 0
2014/02/07 31 1
2014/02/07 3 2
2014/02/08 6 0
2014/02/08 4 3
2014/02/08 17 0
2014/02/08 1 0
2014/02/09 6 0
2014/02/09 8 1
The required output is a:
date, sum_hits, sum_returning, sum_total
2014/02/06 35 0 35
2014/02/07 44 3 47
2014/02/08 28 3 31
2014/02/09 14 1 15
The output is for using Google Charts
For getting the unique date, and counting the values per row, I am creating a dictionary and using the date has the key, something like:
# hits = <object with the input data>
data = {}
for h in hits:
day = h.day_hour.strftime('%Y/%m/%d')
if day in data:
t_hits = int(data[day][0] + h.hits)
t_returning = int(data[day][1] + h.returning)
data[day] = [t_hits, t_returning, t_hits + t_returning]
else:
data[day] = [
h.hits,
h.returning,
int(h.hits + h.returning)]
This creates something like:
{
'2014/02/06' = [35 0 35],
'2014/02/07' = [44 3 47],
'2014/02/08' = [28 3 31],
'2014/02/09' = [14 1 15]
}
And for creating the required output I am doing this:
array()
for k, v in data.items():
row = [k]
row.extend(v)
array.append(row)
which creates an array with the required format:
[
[2014/02/06, 35, 0, 35],
[2014/02/07, 44, 3, 47],
[2014/02/08, 28, 3, 31],
[2014/02/09, 14, 1, 15],
]
So my question basically is, if there is a better way of doing this, or some python internal command that could allow me to group by row fields while counting the row values.
If your input is always sorted (or if you can sort it), you can use itertools.groupby to simplify some of this. groupby, as the name suggests, groups the input elements by the key, and gives you an iterable of (group_key, list_of_values_in_group). Something like the following should work:
import itertools
# the keyfunc extracts the key from each input element
keyfunc = lambda row: row.day_hour.strftime("%Y/%m/%d")
data = []
for day, day_rows in itertools.groupby(hits, key=keyfunc):
sum_hits = 0
sum_returning = 0
for row in day_rows:
sum_hits += int(row.hits)
sum_returning += int(row.returning)
data.append([day, sum_hits, sum_returning, sum_hits + sum_returning])
# data now contains your desired output