How to get the max from row's elements in python? - python

I have a data frame that contains a single column Positive Dispatch,
index Positive Dispatch
0 a,c
1 b
2 a,b
Each keyword has its own value:
a,b,c = 12,22,11
I want to create a new column that contains the max of each row, for example in the first row there are a and c and between them a has the biggest value, which is 12 and so on:
Positive Dispatch Max
a,c 12
b 22
a,b 22
My attempt:
import pandas as pd
dic1 = {
'a': [12,0,22],
'b': [0,13,22],
'c': [12,0,0], # there can be N number of columns here for example
} # 'd': [11,22,333]
a,b,c = 12,22,11 # d will have its own value, for example d = 33
df = pd.DataFrame(dic1)
df['Positive Dispatch'] = df.gt(0).dot(df.columns + ',').str[:-1] #Creating the positive dispatch column
print(df['Positive Dispatch'].max(axis=1))
But this gives the error:
ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>

IIUC:
create a dict then calculate max according to the key and value of the dictionary by using split()+max()+map():
d={'a':a,'b':b,'c':c}
df['Max']=df['Positive Dispatch'].str.split(',').map(lambda x:max([d.get(y) for y in x]))
#for more columns use applymap() in place of map() and logic remains same
OR
If you have more columns like 'Dispatch' then use:
d={'a':a,'b':b,'c':c,'d':d}
df[['Max','Min']]=df[['Positive Dispatch','Negative Dispatch']].applymap(lambda x:max([d.get(y) for y in x.split(',')]))
Sample Dataframe used:
dic1 = {
'a': [12,0,22],
'b': [0,13,22],
'c': [12,0,0], # there can be N number of columns here for example
'd': [11,22,333]}
a,b,c,d = 12,22,11,33 # d will have its own value, for example d = 33
df = pd.DataFrame(dic1)
df['Positive Dispatch'] = df.gt(0).dot(df.columns + ',').str[:-1]
df['Negative Dispatch']=[['a,d'],['c,b,a'],['d,c']]
df['Negative Dispatch']=df['Negative Dispatch'].str.join(',')
output:
a b c Positive Dispatch Max
0 12 0 12 a,c 12
1 0 13 0 b 22
2 22 22 0 a,b 22

Related

How to put the print information into a current dataframe

I am having brain fart. I wrote some code to get keywords from my data frame. It worked, but how can I put the print information into my current data frame. Thank you for the help in advance.
from scipy.sparse import coo_matrix
def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""
#use only topn items from vector
sorted_items = sorted_items[:topn]
score_vals = []
feature_vals = []
# word index and corresponding tf-idf score
for idx, score in sorted_items:
#keep track of feature name and its corresponding score
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]]=score_vals[idx]
return results
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,5)
#now print the results - NEED TO PUT THIS INFORMATION IN MY CURRENT DATAFRAME
print("\nAbstract:")
print(doc)
print("\nKeywords:")
for k in keywords:
print(k,keywords[k])
First: DataFrame is not Excel so it may not look like you may expect.
You can use append() to add new row with text. It should automatically add NaN if row is shorted. OR it will add columns with NaN if row is longer.
import pandas as pd
data = {
'X': ['A','B','C'],
'Y': ['D','E','F'],
'Z': ['G','H','I']
}
df = pd.DataFrame(data)
print(df)
df = df.append({"X": 'Abstract:'}, ignore_index=True)
df = df.append({"X": 'Keywords:'}, ignore_index=True)
keywords = {"first": 123, "second": 456, "third": 789}
for key, value in keywords.items():
df = df.append({"X": key, "Y": value}, ignore_index=True)
print(df)
Result:
# Before
X Y Z
0 A D G
1 B E H
2 C F I
# After
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract: NaN NaN
4 Keywords: NaN NaN
5 first 123 NaN
6 second 456 NaN
7 third 789 NaN
Eventually later you can replace NaN with something else - ie. empty string:
df = df.fillna('')
Result:
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract:
4 Keywords:
5 first 123
6 second 456
7 third 789

Check if numbers are sequential according to another column?

I have a data frame that looks like this:
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D
And my numbers(integers) need to be sequential IF the value in the column "Names" is the same for both numbers: so for example, between 6 and 8 the numbers are not sequential but that is fine since the column "Names" changes from C to D. However, between 8 and 10 this is a problem since both rows have the same value "Names" but are not sequential.
I would like to do a code that returns the numbers missing that need to be added according to the logic explained above.
import itertools as it
import pandas as pd
df = pd.read_excel("booki.xlsx")
c1 = df['Numbers'].copy()
c2 = df['Names'].copy()
for i in it.chain(range(1,len(c2)-1), range(1,len(c1)-1)):
b = c2[i]
c = c2[i+1]
x = c1[i]
n = c1[i+1]
if c == b and n - x > 1:
print(x+1)
It prints the numbers that are missing but two times, so for the data frame in the example it would print:
9
9
but I would like to print only:
9
Perhaps it's some failure in the logic?
Thank you
you can use groupby('Names') and then shift to get the differences between following elements within each group, then pick only the ones that don't have -1 as a differnce, and print their following number.
try this:
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D"""), sep="\s+")
differences = df.groupby('Names', as_index=False).apply(lambda g: g['Numbers'] - g['Numbers'].shift(-1)).fillna(-1).reset_index()
missing_numbers = (df[differences != -1]['Numbers'].dropna()+1).tolist()
print(missing_numbers)
Output:
[9.0]
I'm not sure itertools is needed here. Here is one solution only using pandas methods.
Group the data according to Names column using groupby
Select the min and max from Numbers columns
Define an integer range from min to max
merge this value with the sub dataframe
Filter according missing values using isna
Return the filtered df
Optional : reindex the columns for prettier output with reset_index
Here the code:
df = pd.DataFrame({"Numbers": [0, 1, 2, 3, 4, 5, 6, 8, 10, 15],
"Names": ["A", "A", "B", "B", "C", "C", "C", "D", "D", "D"]})
def select_missing(df):
# Select min and max values
min_ = df.Numbers.min()
max_ = df.Numbers.max()
# Create integer range
serie = pd.DataFrame({"Numbers": [i for i in range(min_, max_ + 1)]})
# Merge with df
m = serie.merge(df, on=['Numbers'], how='left')
# Return rows not matching the equality
return m[m.isna().any(axis=1)]
# Group the data per Names and apply "select_missing" function
out = df.groupby("Names").apply(select_missing)
print(out)
# Numbers Names
# Names
# D 1 9 NaN
# 3 11 NaN
# 4 12 NaN
# 5 13 NaN
# 6 14 NaN
out = out[["Numbers"]].reset_index(level=0)
print(out)
# Names Numbers
# 1 D 9
# 3 D 11
# 4 D 12
# 5 D 13
# 6 D 14

Take the difference of all elements of a series with the previous ones in python pandas

I have a dataframe with sorted values labeled by ids and I want to take the difference of the value for the first element of an id with the value of the last elements of the all previous ids. The code below does what I want:
import pandas as pd
a = 'a'; b = 'b'; c = 'c'
df = pd.DataFrame(data=[*zip([a, a, a, b, b, c, a], [1, 2, 3, 5, 6, 7, 8])],
columns=['id', 'value'])
print(df)
# # take the last value for a particular id
# last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
# print(last_value_for_id)
current_id = ''; prev_values = {};diffs = {}
for t in df.itertuples(index=False):
prev_values[t.id] = t.value
if current_id != t.id:
current_id = t.id
else: continue
for k, v in prev_values.items():
if k == current_id: continue
diffs[(k, current_id)] = t.value - v
print(pd.DataFrame(data=diffs.values(), columns=['diff'], index=diffs.keys()))
prints:
id value
0 a 1
1 a 2
2 a 3
3 b 5
4 b 6
5 c 7
6 a 8
diff
a b 2
c 4
b c 1
a 2
c a 1
I want to do this in a vectorized manner however. I have found a way of getting the series of last elements as in:
# take the last value for a particular id
last_value_for_id = df.loc[df.id.shift(-1) != df.id, :]
print(last_value_for_id)
which gives me:
id value
2 a 3
4 b 6
5 c 7
but can't find a way of using this to take the diffs in a vectorized manner
Depending on how many ids you have, this works with few thousands:
# enumerate ids, should be careful
ids = [a,b,c]
num_ids = len(ids)
# compute first and last
f = df.groupby('id').value.agg(['first','last'])
# lower triangle mask
mask = np.array([[i>=j for j in range(num_ids)] for i in range(num_ids)])
# compute diff of first and last, then mask
diff = np.where(mask, None, f['first'][None,:] - f['last'][:,None])
diff = pd.DataFrame(diff,
index = ids,
columns = ids)
# stack
diff.stack()
output:
a b 2
c 4
b c 1
dtype: object
Edit for updated data:
For the updated data, approach is similar if we can create the f table:
# create blocks of consecutive id
blocks = df['id'].ne(df['id'].shift()).cumsum()
# groupby
groups = df.groupby(blocks)
# create first and last values
df['fv'] = groups.value.transform('first')
df['lv'] = groups.value.transform('last')
# the above f and ids
# note the column name change
f = df[['id','fv', 'lv']].drop_duplicates()
ids = f['id'].values
num_ids = len(ids)
Output:
a b 2
c 4
a 5
b c 1
a 2
c a 1
dtype: object
If you want to go further and drop the index (a,a), well, I'm so lazy :D.
My method
s=df.groupby(df.id.shift().ne(df.id).cumsum()).agg({'id':'first','value':['min','max']})
s.columns=s.columns.droplevel(0)
t=s['min'].values[:,None]-s['max'].values
t=t.astype(float)
Below are all reshape, to match your output
t[np.triu_indices(t.shape[1], 0)] = np.nan
newdf=pd.DataFrame(t,index=s['first'],columns=s['first'])
newdf.values[newdf.index.values[:,None]==newdf.index.values]=np.nan
newdf=newdf.T.stack()
newdf
Out[933]:
first first
a b 2.0
c 4.0
b c 1.0
a 2.0
c a 1.0
dtype: float64

Count frequency of values in pandas DataFrame column

I want to count number of times each values is appearing in dataframe.
Here is my dataframe - df:
status
1 N
2 N
3 C
4 N
5 S
6 N
7 N
8 S
9 N
10 N
11 N
12 S
13 N
14 C
15 N
16 N
17 N
18 N
19 S
20 N
I want to dictionary of counts:
ex. counts = {N: 14, C:2, S:4}
I have tried df['status']['N'] but it gives keyError and also df['status'].value_counts but no use.
You can use value_counts and to_dict:
print df['status'].value_counts()
N 14
S 4
C 2
Name: status, dtype: int64
counts = df['status'].value_counts().to_dict()
print counts
{'S': 4, 'C': 2, 'N': 14}
An alternative one liner using underdog Counter:
In [3]: from collections import Counter
In [4]: dict(Counter(df.status))
Out[4]: {'C': 2, 'N': 14, 'S': 4}
You can try this way.
df.stack().value_counts().to_dict()
Can you convert df into a list?
If so:
a = ['a', 'a', 'a', 'b', 'b', 'c']
c = dict()
for i in set(a):
c[i] = a.count(i)
Using a dict comprehension:
c = {i: a.count(i) for i in set(a)}
See my response in this thread for a Pandas DataFrame output,
count the frequency that a value occurs in a dataframe column
For dictionary output, you can modify as follows:
def column_list_dict(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return dict(column_list_df)

pandas dataframe groupby like mysql, yet into new column

df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.

Categories

Resources