python dataframe unique values - python

I dont have experience with dataframes and i stuck in the following problem:
There is a table looking like that:
enter image description here
parent account account number account name code
0 parent 1 123122 account1 1
1 parent 1 456222 account2 1
2 parent 1 456334 account3 1
3 parent 2 456446 account4 1
4 parent 2 456558 account5 2
5 parent 2 456670 account6 3
6 parent 2 456782 account7 1
7 parent 2 456894 account8 1
8 parent 2 457006 account9 1
9 parent 2 457118 account10 1
10 parent 2 457230 account11 1
11 parent 2 457342 account12 1
12 parent 2 457454 account13 1
13 parent 2 457566 account14 1
14 parent 3 457678 account15 1
15 parent 3 457790 account16 1
16 parent 4 457902 account17 5
17 parent 4 458014 account18 5
18 parent 4 458126 account19 5
19 parent 4 458238 account20 5
20 parent 4 458350 account21 1
I need to check which parents have only one version of code(last column) and which have more
the needed output is table looking like the sample but every parent with only one version of code is not included
> import pandas as pd
>
> read by default 1st sheet of an excel file
> dataframe1 = pd.read_excel("./input/dane.xlsx")
> parents = dataframe1.groupby(["parent account", "code"])
This is the only output I've got on that moment, its something but this is not the result i need
> for i in parents["parent account"]:
> print(list(i)[0])
> ```
> ('parent 1', 1)
> ('parent 2', 1)
> ('parent 2', 2)
> ('parent 2', 3)
> ('parent 3', 1)
> ('parent 4', 1)
> ('parent 4', 5)
Could you please help me with that?

First obtain a list of parent accounts such that they have more than 1 distinct code
condition = df.groupby('parent account').code.nunique() > 1
parent_list = list( condition.index[condition.values] )
Then apply the filter on your data
df[ df['parent acount'].isin(parent_list) ]

Related

Calculate a np.arange within a Panda dataframe from other columns

I want to create a new column with all the coordinates the car needs to pass to a certain goal. This should be as a list in a panda.
To start with I have this:
import pandas as pd
cars = pd.DataFrame({'x_now': np.repeat(1,5),
'y_now': np.arange(5,0,-1),
'x_1_goal': np.repeat(1,5),
'y_1_goal': np.repeat(10,5)})
output would be:
x_now y_now x_1_goal y_1_goal
0 1 5 1 10
1 1 4 1 10
2 1 3 1 10
3 1 2 1 10
4 1 1 1 10
I have tried to add new columns like this, and it does not work
for xy_index in range(len(cars)):
if cars.at[xy_index, 'x_now'] == cars.at[xy_index,'x_1_goal']:
cars.at[xy_index, 'x_car_move_route'] = np.repeat(cars.at[xy_index, 'x_now'].astype(int),(
abs(cars.at[xy_index, 'y_now'].astype(int)-cars.at[xy_index, 'y_1_goal'].astype(int))))
else:
cars.at[xy_index, 'x_car_move_route'] = \
np.arange(cars.at[xy_index,'x_now'], cars.at[xy_index,'x_1_goal'],
(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now']) / (
abs(cars.at[xy_index,'x_1_goal'] - cars.at[xy_index,'x_now'])))
at the end I want the columns x_car_move_route and y_car_move_route so I can loop over the coordinates that they need to pass. I will show it with tkinter. I will also add more goals, since this is actually only the first turn that they need to make.
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]
You can apply() something like this route() function along axis=1, which means route() will receive rows from cars. It generates either x or y coordinates depending on what's passed into var (from args).
You can tweak/fix as needed, but it should get you started:
def route(row, var):
var2 = 'y' if var == 'x' else 'x'
now, now2 = row[f'{var}_now'], row[f'{var2}_now']
goal, goal2 = row[f'{var}_1_goal'], row[f'{var2}_1_goal']
diff, diff2 = goal - now, goal2 - now2
if diff == 0:
result = np.array([now] * abs(diff2)).astype(int)
else:
result = 1 + np.arange(now, goal, diff / abs(diff)).astype(int)
return result
cars['x_car_move_route'] = cars.apply(route, args=('x',), axis=1)
cars['y_car_move_route'] = cars.apply(route, args=('y',), axis=1)
x_now y_now x_1_goal y_1_goal x_car_move_route y_car_move_route
0 1 5 1 10 [1,1,1,1,1] [6,7,8,9,10]
1 1 4 1 10 [1,1,1,1,1,1] [5,6,7,8,9,10]
2 1 3 1 10 [1,1,1,1,1,1,1] [4,5,6,7,8,9,10]
3 1 2 1 10 [1,1,1,1,1,1,1,1] [3,4,5,6,7,8,9,10]
4 1 1 1 10 [1,1,1,1,1,1,1,1,1] [2,3,4,5,6,7,8,9,10]

Recursive tree search to get the node level

I have a dataset that contains a tree similar to the tree below.
son father
1 1 NA
2 2 1
3 3 1
4 4 2
5 5 NA
6 6 2
7 7 4
8 8 5
9 9 4
Built a function that allows me to search the entire hierarchy of a node (son)
getTree = function(sons){
if( length(sons) > 0 ){
sons = subset(df, father %in% sons)[['son']]
sons = c(sons, getTree( sons ))
}
return(sons)
}
subset(df, son %in% getTree(8))
That returns me
son father
4 4 2
6 6 2
7 7 4
9 9 4
However, in addition to the hierarchy, it is necessary to know at which level of the tree that node (child) is. How do I change, or create another function that allows me to achieve this?
Thanks in advance!
I'm not sure exactly what your function is meant to find in the tree, but here's an example in Python that finds the deepest children nodes in the table along with the depth. It uses a incremented counter on each call that keeps track of the depth:
In [140]: def traverse(sons, depth=0):
...: next_sons = sons[sons['father'].isin(sons['son'])]
...: if len(next_sons) > 0:
...: return traverse(next_sons, depth+1)
...: return sons, depth
In [141]: traverse(df)
Out[141]:
( son father
7 7 4.0
9 9 4.0,
3)
Here might be one recursive option for your to keep the track of node level using data.frame, i.e.,
f <- function(sons) {
getTree <- function(s.df) {
repeat {
sons <- subset(
df,
father %in% s.df$sons[s.df$lvl == max(s.df$lvl)]
)[["son"]]
if (length(sons) == 0) {
return(s.df)
}
p <- data.frame(sons = sons, lvl = max(s.df$lvl) + 1)
s.df <- rbind(s.df, getTree(p))
}
}
getTree(data.frame(sons = sons, lvl = 0))
}
where the levels always start from 0 for the input argument sons to function f, such that
> f(1)
sons lvl
1 1 0
2 2 1
3 3 1
4 4 2
5 6 2
6 7 3
7 9 3
> f(2)
sons lvl
1 2 0
2 4 1
3 6 1
4 7 2
5 9 2
> f(5)
sons lvl
1 5 0
2 8 1

How can I improve the performance to collect the information I need from this Pandas dataframe?

First time here and a beginner in Pandas so I'll try to be as clear as possible.
I have a data set that contains a column "Name" that contain child and parent.
The parent row give me a start and a stop value and with that I know what are the child associated with this parent.
My data
In[64]: df
Out[64]:
Name Start Stop Id
0 child 2 4 x
1 child 5 6 x
2 child 7 8 x
3 parent 1 10 x
4 child 12 15 y
5 child 15 16 y
6 child 16 19 y
7 child 20 22 y
8 child 23 24 y
9 parent 11 25 y
10 child 27 28 z
11 child 29 34 z
12 parent 26 35 z
What I want is a dataframe for each parent that will contain all the child row.
The child start and end value must be between the range of value found in the parent and the id needs to match as well.
UPDATE: Multiple parent can have the same Id.
I have a working strategy that goes like this:
Build a dataframe containing all the parent row.
Iterate through all the row of this new dataframe
For each row, check the start,stop and id and test all the row of my source dataframe if it's a match
Append this into a new dataframe and insert into a list of dataframe.
The code looks like this:
import pandas as pd
data = {'Name': ['child','child','child','parent','child','child','child','child','child','parent','child','child','parent'],
'Start': [2,5,7,1,12,15,16,20,23,11,27,29,26],
'Stop': [4,6,8,10,15,16,19,22,24,25,28,34,35],
'Id': ['x','x','x','x','y','y','y','y','y','y','z','z','z']
}
df = pd.DataFrame(data)
dfParent = df[df['Name'].str.contains('parent', regex=False)]
dfList = [] # Creating an empty series of dataframe
for index, row in dfParent.iterrows():
# Create a new dataframe that will contain all child of a parent
dfTemp = pd.DataFrame()
dfTemp = dfTemp.append(df.loc[(df['Start'] > row['Start']) & (df['Stop'] < row['Stop']) & (df['Id'] == row['Id'])])
dfList.append(dfTemp)
dfList
Out[61]:
[ Name Start Stop Id
0 child 2 4 x
1 child 5 6 x
2 child 7 8 x,
Name Start Stop Id
4 child 12 15 y
5 child 15 16 y
6 child 16 19 y
7 child 20 22 y
8 child 23 24 y,
Name Start Stop Id
10 child 27 28 z
11 child 29 34 z]
The result is ok but the performance are terrible when I use my real data set (~500 000 row).
So my question is: Do you have any tips on how I can start improving this code?
Thanks!
Assuming that each Id only has one parent:
dfList = []
for k, d in df.groupby('Id'):
start, stop = d.loc[d["Name"] == 'parent', ['Start', 'Stop']].iloc[0]
dfList.append(d[d["Name"].eq('child') & d["Start"].ge(start) & d["Stop"].le(stop)])
Or you can do a merge and query:
(df[df["Name"].eq('parent')]
.merge(df[df["Name"].eq('child')], on='Id',
suffixes=['_p','_c'])
.query('Start_p<=Start_c<=Stop_c<=Stop_p')
)

Do python recursive calls interfere with each other?

I am trying to set up a recursive game solver (for the cracker-barrel peg game). The recursive function appears to not be operating correctly, and some outputs are created with no trace of how they were created (despite logging all steps). Is it possible that the python recursion steps are overwriting eachother?
I have already tried adding in print statements at all steps of the way. The game rules and algorithms work correctly, but the recursive play algorithm is not operating as expected
def recursive_play(board, moves_list, move_history, id, first_trial, recurse_counter):
# Check how many moves are left
tacks_left = len(char_locations(board, character=tack, grid=True))
log_and_print(f"tacks_left: {tacks_left}")
log_and_print(f"moves_left: {len(moves_list)}")
log_and_print(f"moves_list: {moves_list}")
if (len(moves_list) == 0):
if (tacks_left == 1):
# TODO: Remove final move separator
log_and_print(f"ONE TACK LEFT :)!!!!")
log_and_print(f"move_history to retrun for win: {move_history}")
return move_history
pass
elif (len(moves_list) > 0):
# Scan through all moves and make them recursively
for move in moves_list:
if first_trial:
id += 1
else:
# id += 1
id = id
next_board = make_move(board, move)
next_moves = possible_moves(next_board)
if first_trial:
next_history = "START: " + move
else:
next_history = move_history + round_separator + move
# log_and_print(f"og_board:")
prettify_board(board)
log_and_print(f"move: {move}")
log_and_print(f"next_board:")
prettify_board(next_board)
# log_and_print(f"next_moves: {next_moves}")
log_and_print(f"next_history: {next_history}")
log_and_print(f"id: {id}")
log_and_print(f"recurse_counter: {recurse_counter}")
# NOTE: Would this be cleaner with queues?
recursive_play(next_board, moves_list=next_moves, move_history=next_history, id=id, first_trial=False, recurse_counter=recurse_counter+1)
log_and_print(f"finished scanning all moves for board: {board}")
I expect all steps to be logged, and "START" should only occur on the first trial. However, a mysterious "START" appears in a later step with no trace of how that board was created.
Good Output:
INFO:root:next_history: START: 4 to 2 to 1 , 6 to 5 to 4 , 1 to 3 to 6 , 7 to 4 to 2
INFO:root:id: 1
INFO:root:recurse_counter: 3
INFO:root:tacks_left: 5
INFO:root:moves_left: 2
INFO:root:moves_list: ['9 to 8 to 7', '10 to 6 to 3']
INFO:root:o---
INFO:root:xo--
INFO:root:oox-
INFO:root:xoox
INFO:root:move: 9 to 8 to 7
INFO:root:next_board:
INFO:root:o---
INFO:root:xo--
INFO:root:oox-
INFO:root:xoox
INFO:root:next_history: START: 4 to 2 to 1 , 6 to 5 to 4 , 1 to 3 to 6 , 7 to 4 to 2 , 9 to 8 to 7
INFO:root:id: 1
INFO:root:recurse_counter: 4
INFO:root:tacks_left: 4
INFO:root:moves_left: 1
INFO:root:moves_list: ['10 to 6 to 3']
INFO:root:o---
INFO:root:xx--
INFO:root:ooo-
INFO:root:xooo
INFO:root:move: 10 to 6 to 3
INFO:root:next_board:
INFO:root:o---
INFO:root:xx--
INFO:root:ooo-
INFO:root:xooo
INFO:root:next_history: START: 4 to 2 to 1 , 6 to 5 to 4 , 1 to 3 to 6 , 7 to 4 to 2 , 9 to 8 to 7 , 10 to 6 to 3
Bad Output:
INFO:root:move: 6 to 3 to 1
INFO:root:next_board:
INFO:root:x---
INFO:root:xo--
INFO:root:ooo-
INFO:root:oooo
INFO:root:next_history: START: 6 to 3 to 1
INFO:root:id: 2
INFO:root:recurse_counter: 0
INFO:root:tacks_left: 2
INFO:root:moves_left: 1
INFO:root:moves_list: ['1 to 2 to 4']
INFO:root:o---
INFO:root:oo--
INFO:root:xoo-
INFO:root:oooo
INFO:root:move: 1 to 2 to 4
INFO:root:next_board:
INFO:root:o---
INFO:root:oo--
INFO:root:xoo-
INFO:root:oooo
INFO:root:next_history: START: 6 to 3 to 1 , 1 to 2 to 4
INFO:root:id: 2
INFO:root:recurse_counter: 1
INFO:root:tacks_left: 1
INFO:root:moves_left: 0
INFO:root:moves_list: []
INFO:root:ONE TACK LEFT :)!!!!
INFO:root:move_history to retrun for win: START: 6 to 3 to 1 , 1 to 2 to 4
INFO:root:finished scanning all moves for board: ['o---', 'oo--', 'xoo-', 'oooo']
Any tips anyone can provide would be greatly appreciated.

Count various entrys in a DataFrame

I want to find out how many different devices are in this list?
Is this sufficient for my SQL statement or do I have to do more for it.
Unfortunately, I do not know with such a large amount of data, which method is right and if the solution is right.
Some devices come in more than once. That is, line number not = number of devices
Suggestions as Python or as SQL are welcome
import pandas as pd
from sqlalchemy import create_engine # database connection
from IPython.display import display
disk_engine = create_engine('sqlite:///gender-train-devices.db')
phones = pd.read_sql_query('SELECT device_id, COUNT(device_id) FROM phone_brand_device_model GROUP BY [device_id]', disk_engine)
print phones
the output is:
device_id COUNT(device_id)
0 -9223321966609553846 1
1 -9223067244542181226 1
2 -9223042152723782980 1
3 -9222956879900151005 1
4 -9222896629442493034 1
5 -9222894989445037972 1
6 -9222894319703307262 1
7 -9222754701995937853 1
8 -9222661944218806987 1
9 -9222399302879214035 1
10 -9222352239947207574 1
11 -9222173362545970626 1
12 -9221825537663503111 1
13 -9221768839350705746 1
14 -9221767098072603291 1
15 -9221674814957667064 1
16 -9221639938103564513 1
17 -9221554258551357785 1
18 -9221307795397202665 1
19 -9221086586254644858 1
20 -9221079146476055829 1
21 -9221066489596332354 1
22 -9221046405740900422 1
23 -9221026417907250887 1
24 -9221015678978880842 1
25 -9220961720447724253 1
26 -9220830859283101130 1
27 -9220733369151052329 1
28 -9220727250496861488 1
29 -9220452176650064280 1
... ... ...
186686 9219686542557325817 1
186687 9219842210460037807 1
186688 9219926280825642237 1
186689 9219937375310355234 1
186690 9219958455132520777 1
186691 9220025918063413114 1
186692 9220160557900894171 1
186693 9220562120895859549 1
186694 9220807070557263555 1
186695 9220814716773471568 1
186696 9220880169487906579 1
186697 9220914901466458680 1
186698 9221114774124234731 1
186699 9221149157342105139 1
186700 9221152396628736959 1
186701 9221297143137682579 1
186702 9221586026451102237 1
186703 9221608286127666096 1
186704 9221693095468078153 1
186705 9221768426357971629 1
186706 9221843411551060582 1
186707 9222110179000857683 1
186708 9222172248989688166 1
186709 9222214407720961524 1
186710 9222355582733155698 1
186711 9222539910510672930 1
186712 9222779211060772275 1
186713 9222784289318287993 1
186714 9222849349208140841 1
186715 9223069070668353002 1
[186716 rows x 2 columns]
If you want the number of different devices, you can just query the database:
SELECT COUNT(distinct device_id)
FROM phone_brand_device_model ;
Of course, if you already have the data in a data frame for some other purpose you can count the number of rows there.
If you already have data in memory as a dataframe, you can use:
df['device_id'].nunique()
otherwise use Gordon's solution - it should be faster
If you want to do it in pandas. You can do something like:
len(phones.device_id.unique())

Categories

Resources