remove values from pandas df and move remaining upwards - python

I have a dataframe with categorical data in it.
I have come with a procedure to keep only desired categories, while moving up the remaining categories in the empty cells of deleted values.
But I want to do it without the list intermediaries if possible.
import pandas as pd
mydf = pd.DataFrame(data = {'a': [9,6,3,8,5],
'b': [4, 3,5,6,7],
'c': [5, 3,6,9,10]
}
)
selecList = [5,8,4,6] # only this categories shall remain
mydf
a b c
0 9 4 5
1 6 3 3
2 3 5 6
3 8 6 9
4 5 7 10
Desired Output
a b c
0 6 4 5
1 8 5 6
2 5 6 <NA>
My workaround:
myList = mydf.T.values.tolist()
myList
[[9, 6, 3, 8, 5], [4, 3, 5, 6, 7], [5, 3, 6, 9, 10]]
filtered_list = [[x for x in y if x in selecList ] for y in myList]
filtered_list
[[6, 8, 5], [4, 5, 6], [5, 6]]
filtered_df = pd.DataFrame(filtered_list).T
filtered_df.columns = list(mydf)
filtered_df = filtered_df.astype('Int64')
Unsuccessful try:
pd.DataFrame(mydf.apply(lambda y: [x for x in y if x in selecList ])).T

Here is an alternative solution:
df.where(df.isin(selecList)).dropna(how='all')
Here is a another solution:
df.where(df.isin(selecList)).stack().droplevel(0).to_frame().assign(i = lambda x: x.groupby(level=0).cumcount()).set_index('i',append=True)[0].unstack(level=0)

Related

Is there any method to append test data with predicted data?

I have 1 random array of tested dataset like array=[[5, 6 ,7, 1], [5, 6 ,7, 4], [5, 6 ,7, 3]] and 1 array of predicted data like array_pred=[10, 3, 4] both with the equal length. Now I want to append this result like this in 1 res_array = [[5, 6 ,7, 1, 10], [5, 6 ,7, 4, 3], [5, 6 ,7, 3, 4]]. I don't know what to say it but I want this type of result in python. Actually I have to store it in a dataframe and then have to generate an excel file from this data. this is what I want. Is it possible??
Use numpy.vstack for join arrays, convert to Series and then to excel:
a = np.hstack((array, np.array(array_pred)[:, None]))
#thank you #Ch3steR
a = np.column_stack([array, array_pred])
print(a)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s = pd.Series(a.tolist())
print (s)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s.to_excel(file, index=False)
Or if need flatten values convert to DataFrame, Series and use concat:
df = pd.concat([pd.DataFrame(array), pd.Series(array_pred)], axis=1, ignore_index=True)
print(df)
0 1 2 3 4
0 5 6 7 1 10
1 5 6 7 4 3
2 5 6 7 3 4
And then:
df.to_excel(file, index=False)

How to print each element of a list of lists of variable length as a column in python?

Consider the following list:
L = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
How can I achieve this printing pattern?
1 4 7
2 5 8
3 6 9
More specifically, how can I do it in general for any number of elements in L while assuming that all the nested lists in L have the same length?
Try this:
L = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
for x in zip(*L):
print(*x)
Output:
1 4 7
2 5 8
3 6 9
L = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
for i in zip(*L):
print(*i)
Produces:
1 4 7
2 5 8
3 6 9
(*i) says to do all of the arguments inside of i, however many there are.

Pandas DataFrame filter by multiple column criterias and multiple intervals

I have checked several answers but found no luck so far.
My dataset is like this:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
Location Place Value1 Value2
A 1 1 1
A 2 1 1
A 3 2 2
B 4 3 3
C 2 4 4
C 3 5 5
and I have a list of intervals:
A: [0, 1]
A: [3, 5]
B: [1, 3]
C: [1, 4]
C: [6, 10]
Now I want that every row that have Location equal to that of the filter list, should have the Place in range of the filter. So the desired output will be:
Location Place Value1 Value2
A 1 1 1
A 3 2 2
C 2 4 4
C 3 5 5
I know that I can chain multiple between conditions by | , but I have a really long list of intervals so manually enter the condition is not feasible. I also consider forloop to slice the data by location first, but I think there could be more efficient way.
Thank you for your help.
Edit: Currently the list of intervals is just strings like this
A 0 1
A 3 5
B 1 3
C 1 4
C 6 10
but I would like to slice them into list of dicts. Better structure for it is also welcome!
First define dataframe df and filters dff:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
dff = pd.DataFrame({'Location':['A','A','B','C','C'],
'fPlace':[[0,1], [3, 5], [1, 3], [1, 4], [6, 10]]})
dff[['p1', 'p2']] = pd.DataFrame(dff["fPlace"].to_list())
now dff is:
Location fPlace p1 p2
0 A [0, 1] 0 1
1 A [3, 5] 3 5
2 B [1, 3] 1 3
3 C [1, 4] 1 4
4 C [6, 10] 6 10
where fPlace transformed to lower and upper bounds p1 and p2 indicates filters that should be applied to Place. Next:
df.merge(dff).query('Place >= p1 and Place <= p2').drop(columns = ['fPlace','p1','p2'])
result:
Location Place Value1 Value2
0 A 1 1 1
5 A 3 2 2
7 C 2 4 4
9 C 3 5 5
Prerequisites:
# presumed setup for your intervals:
intervals = {
"A": [
[0, 1],
[3, 5],
],
"B": [
[1, 3],
],
"C": [
[1, 4],
[6, 10],
],
}
Actual solution:
x = df["Location"].map(intervals).explode().str
l, r = x[0], x[1]
res = df["Place"].loc[l.index].between(l, r)
res = res.loc[res].index.unique()
res = df.loc[res]
Outputs:
>>> res
Location Place Value1 Value2
0 A 1 1 1
2 A 3 2 2
4 C 2 4 4
5 C 3 5 5

Why is pandas df.add_suffix() not working with for-loop

I am trying to use pandas df.add_suffix() for multiple dataframes, that are stored in a list via a for-loop:
df_1 = pd.DataFrame({'X': [2, 3, 4, 5], 'Y': [4, 5, 6, 7]})
df_2 = pd.DataFrame({'X': [6, 7, 8, 9], 'Y': [9, 8, 7, 6]})
df_3 = pd.DataFrame({'X': [6, 3, 1, 13], 'Y': [7, 0, 1, 4]})
mylist = [df_1, df_2, df_3]
for i in mylist:
i = i.add_suffix('_test')
However when I print the dataframes afterwards, i see still the old column names "X" and "Y".
When doing the same operation on each of the dataframes separately:
df1 = df_1.add_suffix('_test')
everything works as expected and I get the column names "X_test" and "Y_test".
Does anyone have any idea, what I am missing here?
you are changing the value of the variable i but i it is not the same with the mylist elements, when you are iterating using for loop you are assigning to the variable i consecutive elements from mylist, you should use the list index to change the elements:
for i in range(len(mylist)):
mylist[i] = mylits[i].add_suffix('_test')
Problem is output is no assign back to list, so no change.
Solution if want assign to same list of DataFrames with enumerate for indexing:
for j,i in enumerate(mylist):
mylist[j] = i.add_suffix('_test')
print (mylist)
[ X_test Y_test
0 2 4
1 3 5
2 4 6
3 5 7, X_test Y_test
0 6 9
1 7 8
2 8 7
3 9 6, X_test Y_test
0 6 7
1 3 0
2 1 1
3 13 4]
Or if want new list of DataFrames use list comprehension:
dfs = [i.add_suffix('_test') for i in mylist]
print (dfs)
[ X_test Y_test
0 2 4
1 3 5
2 4 6
3 5 7, X_test Y_test
0 6 9
1 7 8
2 8 7
3 9 6, X_test Y_test
0 6 7
1 3 0
2 1 1
3 13 4]
df_1 = pd.DataFrame({'X': [2, 3, 4, 5], 'Y': [4, 5, 6, 7]})
df_2 = pd.DataFrame({'X': [6, 7, 8, 9], 'Y': [9, 8, 7, 6]})
df_3 = pd.DataFrame({'X': [6, 3, 1, 13], 'Y': [7, 0, 1, 4]})
mylist = [df_1, df_2, df_3]
for i,j in enumerate(mylist):
mylist[i] = j.add_suffix('_test')
The updated dfs are in the list(mylist) rather than the original one.

Been trying to build new data frame from existing data frame and series

I'm trying to write a loop that does the following:
df_f.ix[0] = df_n.loc[0]
df_f.ix[1] = h[0]
df_f.ix[2] = df_n.loc[1]
df_f.ix[3] = h[1]
df_f.ix[4] = df_n.loc[2]
df_f.ix[5] = h[2]
...
df_f.ix[94778] = df_n.loc[47389]
df_f.ix[94779] = h[47389]
Basically, row 1 (and all the rows incremented by 2) of data frame df_f is equal to row 1 of data frame df_n (and its rows incremented by 1) and row 2 (and the rows incremented by 2) of df_f is equal to row 1 (and its rows incremented by 1) of series h. And so on...Can anyone help?
You don't necessarily need loops... You can just create a new list of data from your existing data frame/series and then make that int a new DataFrame
import pandas as pd
#example data
df_n = pd.DataFrame([1,2, 3, 4,5])
h = pd.Series([99, 98, 97, 96, 95])
new_data = [None] * (len(df_n) * 2)
new_data[::2] = df_n.loc[:, 0].values
new_data[1::2] = h.values
new_df = pd.DataFrame(new_data)
In [135]: new_df
Out[135]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
If you really want a loop that will do it you could create an empty data frame like so:
other_df = pd.DataFrame([None] * (len(df_n) * 2))
y = 0
for x in xrange(len(df_n)):
other_df.loc[y] = df_n.loc[x]
y+=1
other_df.loc[y] = h[x]
y+=1
In [136]: other_df
Out[136]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
This is easy to do in Numpy. You can retrieve the data from a Pandas Dataframe using df.values.
>>> import numpy as np
>>> import pandas as pd
>>> df_a, df_b = pd.DataFrame([1, 2, 3, 4]), pd.DataFrame([5, 6, 7, 8])
>>> df_a
0
0 1
1 2
2 3
3 4
>>> df_b
0
0 5
1 6
2 7
3 8
>>> np_a, np_b = df_a.values, df_b.values
>>> np_a
array([[1],
[2],
[3],
[4]])
>>> np_b
array([[5],
[6],
[7],
[8]])
>>> np_c = np.hstack((np_a, np_b))
>>> np_c
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
>>> np_c = np_c.flatten()
>>> np_c
array([1, 5, 2, 6, 3, 7, 4, 8])
>>> df_c = pd.DataFrame(np_c)
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
All of this in one line, given df_a and df_b:
>>> df_c = pd.DataFrame(np.hstack((df_a.values, df_b.values)).flatten())
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
Edit:
If you have more than one column, which is the general case,
>>> df_a = pd.DataFrame([[1, 2], [3, 4]])
>>> df_b = pd.DataFrame([[5, 6], [7, 8]])
>>> df_a
0 1
0 1 2
1 3 4
>>> df_b
0 1
0 5 6
1 7 8
>>> np_a = df_a.values
>>> np_a = np_a.reshape(np_a.shape[0], 1, np_a.shape[1])
>>> np_a
array([[[1, 2]],
[[3, 4]]])
>>> np_b = df_b.values
>>> np_b = np_b.reshape(np_b.shape[0], 1, np_b.shape[1])
>>> np_b
array([[[5, 6]],
[[7, 8]]])
>>> np_c = np.concatenate((np_a, np_b), axis=1)
>>> np_c
array([[[1, 2],
[5, 6]],
[[3, 4],
[7, 8]]])
>>> np_c = np_c.reshape(np_c.shape[0] * np_c.shape[2], np_c.shape[1])
>>> np_c
array([[1, 2],
[5, 6],
[3, 4],
[7, 8]])
>>> df_c = pd.DataFrame(np_c)

Categories

Resources