I have the following table.
| product | check | check1 | type | amount |
|---------|-------|--------|------|--------|
| A | 1 | a | c | -10 |
| A | 1 | a | p | 20 |
| B | 2 | b | c | 20 |
| B | 2 | b | p | 20 |
| C | 3 | c | c | -10 |
| D | 4 | d | p | 15 |
| D | 4 | d | c | -15 |
I want to sum the amount for the rows where the first three columns are equal and one row in the 'type' columns contains 'C' and the other row a 'P' then also where 'type' = 'C' amount should be negative and when 'type' = 'P' the amount should be positive, otherwise they should not be summed. if they are summed the if the 'amount' is negative, 'type' should be 'c' otherwise 'p'.see required output below:
| product | check | check1 | type | amount |
|---------|-------|--------|------|--------|
| A | 1 | a | p | 10 |
| B | 2 | b | c | 20 |
| B | 2 | b | p | 20 |
| C | 3 | c | c | -10 |
| D | 4 | d | p | 0 |
i have tried group.by on the first three columns and then apply a lambda function;
df = df.groupby(['product', 'check', 'check1']).apply(lambda x, y : x + y, x.loc[(x['type']=='c')], y.loc[(y['type']=='p')], 'amount')
This gives a NameError where 'x' is not defined. I am also not sure if this is the right way to go, so if you have any tips please let me know!
Here is a solution for this, maybe not efficient but it works!
new_df = pd.DataFrame()
for product in df['product'].unique():
for check in df[df['product'] == product].check.unique():
for check1 in df[(df['product'] == product) & (df.check == check)].check1.unique():
tmp = df[(df['product'] == product) & (df.check == check) & (df.check1 == check1)]
if len(tmp[((tmp.type == 'c') & (tmp.amount < 0)) | ((tmp.type == 'p') & (tmp.amount > 0))]) != 2:
new_df = new_df.append(tmp, ignore_index=True)
else:
amount = tmp.sum()['amount']
type = 'c' if amount < 0 else 'p'
elt = {
'product': product,
'check': check,
'check1': check1,
'type': type,
'amount': amount
}
new_df = new_df.append(pd.Series(elt), ignore_index=True)
Related
I have a csv which has data that looks like this
id | code | date
-------------+-----------------------------
| 1 | 2 | 2022-10-05 07:22:39+00::00 |
| 1 | 0 | 2022-11-05 02:22:35+00::00 |
| 2 | 3 | 2021-01-05 10:10:15+00::00 |
| 2 | 0 | 2019-01-11 10:05:21+00::00 |
| 2 | 1 | 2022-01-11 10:05:22+00::00 |
| 3 | 2 | 2022-10-10 11:23:43+00::00 |
I want to remove duplicate id based on the following condition -
For code column, choose the value which is not equal to 0 and then choose one which is having latest timestamp.
Add another column prev_code, which contains list of all the remaining value of the code that's not present in code column.
Something like this -
id | code | prev_code
-------------+----------
| 1 | 2 | [0] |
| 2 | 1 | [0,2] |
| 3 | 2 | [] |
There is probably a sleeker solution but something along the following lines should work.
df = pd.read_csv('file.csv')
lastcode = df[df.code!=0].groupby('id').apply(lambda block: block[block['date'] == block['date'].max()]['code'])
prev_codes = df.groupby('id').agg(code=('code', lambda x: [val for val in x if val != lastcode[x.name].values[0]]))['code']
pd.DataFrame({'id': map(lambda x: x[0], lastcode.index.values), 'code': lastcode.values, 'prev_code': prev_codes.values})
Can anyone help me sort the order of last page viewed?
I have a dataframe where I am attempting to sort it by the previous page viewed and I am having a really hard time coming up with an efficient method using Pandas.
For example from this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | A | D |
| 1051471580 | C | B |
| 1051471580 | A | exit |
| 1051471580 | B | A |
| 1051471580 | D | A |
| 1051471580 | entrance | C |
+------------+------------------+----------+
To this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | entrance | C |
| 1051471580 | C | B |
| 1051471580 | B | A |
| 1051471580 | A | D |
| 1051471580 | D | A |
| 1051471580 | A | exit |
+------------+------------------+----------+
However it could be millions of rows long for thousands of different customers so I really need to think how to make this efficient.
pd.DataFrame({
'Customer':'1051471580',
'previousPagePath': ['E','C','B','A','D','A'],
'pagePath': ['C','B','A','D','A','F']
})
Thanks!
What you're trying to do is topological sorting, which can be achieved with networkx. Note that I had to change some values in your dataframe in order to prevent it throwing a cycle error, so I hope that the data you work on contains unique values:
import networkx as nx
import pandas as pd
data = [ [1051471580, "Z", "D"], [1051471580,"C","B" ], [1051471580,"A","exit" ], [1051471580,"B","Z" ], [1051471580,"D","A" ], [1051471580,"entrance","C" ] ]
df = pd.DataFrame(data, columns=['Customer', 'previousPagePath', 'pagePath'])
edges = df[df.pagePath != df.previousPagePath].reset_index()
dg = nx.from_pandas_edgelist(edges, source='previousPagePath', target='pagePath', create_using=nx.DiGraph())
order = list(nx.lexicographical_topological_sort(dg))
result = df.set_index('previousPagePath').loc[order[:-1], :].dropna().reset_index()
result = result[['Customer', 'previousPagePath', 'pagePath']]
Output:
| | Customer | previousPagePath | pagePath |
|---:|-----------:|:-------------------|:-----------|
| 0 | 1051471580 | entrance | C |
| 1 | 1051471580 | C | B |
| 2 | 1051471580 | B | Z |
| 3 | 1051471580 | Z | D |
| 4 | 1051471580 | D | A |
| 5 | 1051471580 | A | exit |
you can sort your DataFrame by column like that.
df = pd.DataFrame({'Customer':'1051471580','previousPagePath':['E','C','B','A','D','A'], 'pagePath':['C','B','A','D','A','F']})
df.sort_values(by='previousPagePath')
and you can find the document here pandas.DataFrame.sort_values
I have a multiindexed dataframe where the index levels have multiple categories, something like this:
|Var1|Var2|Var3|
|Level1|Level2|Level3|----|----|----|
| A | A | A | | | |
| A | A | B | | | |
| A | B | A | | | |
| A | B | B | | | |
| B | A | A | | | |
| B | A | B | | | |
| B | B | A | | | |
| B | B | B | | | |
In summary, and specifically in my case, Level 1 has 2 levels, Level 2 has 24, Level 3 has 6, and there are also Levels 4 (674) and Level 5 (9) (with some minor variation depending on specific higher-level values - Level1 == 1 actually has 24 Level2s, but Level1 == 2 has 23).
I need to generate all possible combinations of 3 at Level 5, then calculate their means for Vars 1-3.
I am trying something like this:
# Resulting df to be populated
df_result = pd.DataFrame([])
# Retrieving values at Level1
lev1s = df.index.get_level_values("Level1").unique()
# Looping through each Level1 value
for lev1 in lev1s:
# Filtering df based on Level1 value
df_lev1 = df.query('Level1 == ' + str(lev1))
# Repeating...
lev2s = df_lev1.index.get_level_values("Level2").unique()
for lev2 in lev2s:
df_lev2 = df_lev1.query('Level2 == ' + str(lev2))
# ... until Level3
lev3s = df_lev2.index.get_level_values("Level3").unique()
# Creating all combinations
combs = itertools.combinations(lev3s, 3)
# Looping through each combination
for comb in combs:
# Filtering values in combination
df_comb = df_wl.query('Level3 in ' + str(comb))
# Calculating means using groupby (groupby might not be necessary,
# but I don't believe it has much of an impact
df_means = df_comb.reset_index().groupby(['Level1', 'Level2']).mean()
# Extending resulting dataframe
df_result = df_result.append(df_means)
The thing is, after a little while, this process gets really slow. Since I have around 2 * 24 * 6 * 674 levels and 84 combinations (of 9 elements, 3 by 3), I am expecting more than 16 million df_meanss to be calculated.
Is there any more efficient way to do this?
Thank you.
Let's say I have a map written in ASCII. This map represents the gardens of some people. I have to write a program that, given the map returns how many trees there are in each garden. A mockup map:
+-----------+------------------------------------+
| | B A |
| A A A A | A (Jennifer) |
| | |
| C +--------------+---------------------+
| B C | |
| B C | B B |
| B C | (Marta) |
| B | |
+--------------+ | |
| | | |
| (Peter) B | | A |
| C | (Maria) | |
| A | | |
+--------------+ +---------------------+
| | | |
| | | |
| | | |
| (Elsa) + | (Joe) |
| / | C |
| C A / A + C A A |
| B / A B \ A B |
| B A / C \ B |
+---------+----+---------- +--+------------------+
the output should be something like:
Jennifer B:1 A:1 C:0
Marta B:2 A:1 C:0
Peter A:1 B:1 C:1
...
Joe A:3 B:2 C:2
Is there any package in python or any algorithm that I can study to understand how to perform this task?
I would start creating a matrix of chars from that ascii.
Then I would find all the (i,j) of the '('.
Found that indexes you can create a dictionary with the name of the people as key and another dictionary as value. Each one of the inner dictionaries will have the tree name as key and a integer as value.
Knowing (i,j) of the '(' i would read the names and initilize the dictionary.
Now, foreach (i,j) pointing at a '(' do: (let name be the name related to the '('
che if at the left of '(' there is a letter. if you find a letter x let dict[name][x]++ (stop if you find any of ('|','','/','+')
do the same for the right
start going upwards, for each line check left and right
do the same while going downwards
you just have to play a bit to understand how to recognize correctly the wall between maria and jennifer.
Here is an example in JavaScript that should be fairly easy to translate to Python. It scans left to right, relying on a determination of the current labeled area. Hopefully the function names and comments make clear what it happening. If you have questions, please ask.
function f(map){
const m = map.split('\n')
.filter(x => x)
.map(x => x.trim());
const [h, w] = [m.length, m[0].length];
const borders = ['+', '-', '|', '/', '\\'];
const trees = ['A', 'B', 'C'];
const labelToName = {};
const result = {};
let prevLabels = new Array(w).fill(0);
let currLabels = new Array(w).fill(0);
let label = 1;
function getLabel(y, x){
// A label is the same as
// a non-border to the left,
// above, or northeast.
if (!borders.includes(m[y][x-1]))
return currLabels[x-1];
else if (!borders.includes(m[y-1][x]))
return prevLabels[x];
else if (!borders.includes(m[y-1][x+1]))
return prevLabels[x+1];
else
return label++;
}
function update(label, tree){
if (!result[label])
result[label] = {[tree]: 1};
else if (result[label][tree])
result[label][tree]++;
else
result[label][tree] = 1;
}
for (let y=1; y<h-1; y++){
for (let x=1; x<w-1; x++){
const tile = m[y][x];
if (borders.includes(tile))
continue;
const currLabel = getLabel(y, x);
currLabels[x] = currLabel;
if (tile == '('){
let name = '';
while (m[y][++x] != ')'){
name += m[y][x];
currLabels[x] = currLabel;
}
currLabels[x] = currLabel;
labelToName[currLabel] = name;
} else if (trees.includes(tile)){
update(currLabel, tile);
}
}
prevLabels = currLabels;
currLabels = new Array(w).fill(0);
}
return [result, labelToName];
}
var map = `
+-----------+------------------------------------+
| | B A |
| A A A A | A (Jennifer) |
| | |
| C +--------------+---------------------+
| B C | |
| B C | B B |
| B C | (Marta) |
| B | |
+--------------+ | |
| | | |
| (Peter) B | | A |
| C | (Maria) | |
| A | | |
+--------------+ +---------------------+
| | | |
| | | |
| | | |
| (Elsa) + | (Joe) |
| / | C |
| C A / A + C A A |
| B / A B \\ A B |
| B A / C \\ B |
+---------+----+---------- +--+------------------+
`;
var [result, labelToName] = f(map);
for (let label in result)
console.log(`${ labelToName[label] }: ${ JSON.stringify(result[label]) }`)
Suppose I have a simple pandas dataframe df as so:
| name | car |
|----|-----------|-------|
| 0 | 'bob' | 'b' |
| 1 | 'bob' | 'c' |
| 2 | 'fox' | 'b' |
| 3 | 'fox' | 'c' |
| 4 | 'cox' | 'b' |
| 5 | 'cox' | 'c' |
| 6 | 'jo' | 'b' |
| 7 | 'jo' | 'c' |
| 8 | 'bob' | 'b' |
| 9 | 'bob' | 'c' |
| 10 | 'bob' | 'b' |
| 11 | 'bob' | 'c' |
| 12 | 'rob' | 'b' |
| 13 | 'rob' | 'c' |
I would like to find the row indices of a specific pattern that spans both columns. In my real application, the above dataframe has a few thousand rows and I have a few thousand dataframes so performance is not important. The pattern, say, that I am interested in is:
| 'bob' | 'b' |
| 'bob' | 'c' |
Hence, using the above example, my desired output would be:
out_idx = [0,1,8,9,10,11]
Typically of course, for one pattern, one would do something like df.loc[(df.name == 'bob') & (df.car == 'b')] but I am not sure how to do it when I am looking for a specific and multivariate pattern over multiple columns. I.e. I am looking for (and I am pretty the following is not correct): df.loc[(df.name == 'bob') & (df.car == 'b') & (df.car == 'c')].
Help much appreciated. Thx!
Use boolean indexing with Series.isin instead second and third conditions:
df1 = df[(df.name == 'bob') & df.car.isin(['b','c'])]
print (df1)
name car
0 bob b
1 bob c
8 bob b
9 bob c
10 bob b
11 bob c
If need index values:
out_idx = df.index[(df.name == 'bob') & df.car.isin(['b','c'])]
Or:
out_idx = df[(df.name == 'bob') & df.car.isin(['b','c'])].index
Your solution is possible with | (bitwise OR) instead second & and also added one ():
df1 = df[(df.name == 'bob') & ((df.car == 'b') | (df.car == 'c'))]