I have a dataset that looks like
Node Target Group
A B 1
A C 1
B D 2
F E 3
F A 3
G M 3
I would like to create a graph for each distinct value in Group. There are 5 groups in total.
file_num = 1
for item <=5: # this is wrong
item.plot(title='Iteration n.'+item)
plt.savefig(f'{file}_{file_num}.png')
file_num += 1
G = nx.from_pandas_edgelist(df, source='Node', target='Target')
cytoscapeobj = ipycytoscape.CytoscapeWidget()
cytoscapeobj.graph.add_graph_from_networkx(G)
cytoscapeobj
This code, however, does not generate individual graphs (G and the objet from Cytoscape), meaning that something does not work within the loop (at least).
Any help would be extremely useful.
Let's call the DataFrame df. Then you iterate over the unique values in the Group column, subset df by this Group value, and save a new figure each iteration of the loop.
file_num = 1
for group_val in df.Group.unique():
df_group = item[item['group'] == group_val]
df_group.plot(title='Iteration n.'+item)
plt.savefig(f'{file}_{file_num}.png')
file_num += 1
G = nx.from_pandas_edgelist(df_group, source='Node', target='Target')
cytoscapeobj = ipycytoscape.CytoscapeWidget()
cytoscapeobj.graph.add_graph_from_networkx(G)
cytoscapeobj
You should use display() function on cytoscapeobj:
for group_val in df.Group.unique():
<...>
display(cytoscapeobj)
Related
Generate an example dataframe
import random
import string
import numpy as np
df = pd.DataFrame(
columns=[random.choice(string.ascii_uppercase) for i in range(5)],
data=np.random.rand(10,5))
df
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
1 0.933778 0.393021 0.547383 0.469255 0.053089
2 0.994518 0.156547 0.917894 0.070152 0.201373
3 0.077694 0.685540 0.865004 0.830740 0.605135
4 0.760294 0.838441 0.905885 0.146982 0.157439
5 0.116676 0.340967 0.400340 0.293894 0.220995
6 0.632182 0.663218 0.479900 0.931314 0.003180
7 0.726736 0.276703 0.057806 0.624106 0.719631
8 0.677492 0.200079 0.374410 0.962232 0.915361
9 0.061653 0.984166 0.959516 0.261374 0.361677
Now I want to filter a dataframe using the values in the first column, but since I make heavy use of chaining (e.g. df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).pipe(func)) I need a much more compact notation for the operation. Normally you'd do something like
df[df.iloc[:, 0] < 0.5]
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
but the awkwardly redundant syntax is horrible for chaining. I want to replace it with a .query(), and normally you'd use the column name like df.query('V < 0.5'), but here I want to be able to query the table by column index number instead of by name. So in the example, I've deliberately randomized the column names. I also can not use the table name in the query like df.query('#df[0] < 0.5') since in a long chain, the intermediate result has no name.
I'm hoping there is some syntax such as df.query('_[0] < 0.05') where I can refer to the source table as some symbol _.
You can using f-string notation in df.query:
df.query(f'{df.columns[0]} < .5')
Output:
J M O R N
3 0.114554 0.131948 0.650307 0.672486 0.688872
4 0.272368 0.745900 0.544068 0.504299 0.434122
6 0.418988 0.023691 0.450398 0.488476 0.787383
7 0.040440 0.220282 0.263902 0.660016 0.955950
Update using "walrus" operator in python 3.8+
Let's try this:
((dfout := df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).to_frame(name='values'))
.query(f'{dfout.columns[0]} > -2'))
output:
values
N -1.356779
O -1.202353
M -1.591623
T -1.557801
You can use lambda functions in loc, which passes in the dataframe. You can then use iloc for your positional indexing. So you could do:
df.loc[lambda x: x.iloc[:, 0] > 0.5]
This should work in a method chain.
For a single column with index:
df.query(f"{df.columns[0]}<0.5")
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
For multiple columns with index:
idx = [0,1]
col = df.columns[np.r_[idx]]
val = 0.5
query = ' and '.join([f"{i} < {val}" for i in col])
# V < 0.5 and O < 0.5
print(df.query(query))
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
5 0.116676 0.340967 0.400340 0.293894 0.220995
I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03
I have a huge table file that looks like the following. In order to work on individual products (name), I tried to use pandas groupby, but it seems to put the whole table (~10G) in memory, which I cannot afford.
name index change
A Q QQ
A Q QQ
A Q QQ
B L LL
C Q QQ
C L LL
C LL LL
C Q QQ
C L LL
C LL LL
The name column is well sorted and I will only care about one name at a time. I hope to use the following criteria on column "change" to filter each name:
Check if number of "QQ" overwhelms number of "LL". Basically, if the number of rows contain "QQ" subtracts the number of rows contain "LL" >=2, then discard/ignore the "LL" rows for this name from now on. If "LL" overwhelms "QQ", then discard rows with QQ. (E.g. A has 3 QQ and 0 LL, and C has 4 LL and 2 QQ. They both are fine.)
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
C L LL
C LL LL
C L LL
C LL LL
Comparing "change" to "index", if no change occurs (e.g. LL in both columns), the row is not valid. Further, for the valid changes, the remaining QQ or LL has to be continuous for >=3 times. Therefore C only has 2 valid changes, and it will be filtered out.
Resulting table:
name index change
A Q QQ
A Q QQ
A Q QQ
I wonder if there is a way to just work on the table name by name, and release memory after each name please. (And don't have to do the two criteria step by step.) Any hint or suggestion will be appreciated!
Because the file is sorted by "name", you can read the file row-by-row:
def process_name(name, data, output_file):
group_by = {}
for index, change in data:
if index not in group_by:
group_by[index] = []
group_by[index].append(change)
# do the step 1 filter logic here
# do the step 2 filter logic here
for index in group_by:
if index == group_by[index]:
# Because there is at least one "no change" this
# whole "name" can be thrown out, so return here.
return
output = []
for index in group_by:
output_file.write("%s\t%s\t%s\n" % (name, index, group_by[index]))
current_name = None
current_data = []
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
header = input_file.readline()
for row in input_file:
cols = row.strip().split("\t")
name = cols[0]
index = cols[1]
change = cols[2]
if name != current_name:
if name != None:
process_name(current_name, current_data, output_file)
current_name = name
current_data = []
current_data.append((index, change))
# process what's left in the buffer
if current_name is not None:
process_name(current_name, current_data, output_file)
input_file.close()
output_file.close()
I don't totally understand the logic you've explained in #1, so I left that blank. I also feel like you probably want to do step #2 first as that will quickly rule out entire "name"s.
Since your file is sorted and you only seem to be operating on the sub segments by name, perhaps just use Python's groupby and create a table for each name segment as you go:
from itertools import groupby
import pandas as pd
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
for e in segment:
for key, v in zip(header, e.split()):
seg_data[key].append(v)
seg_fram=pd.DataFrame.from_dict(seg_data)
print k
print seg_fram
print
Prints:
A
change index name
0 QQ Q A
1 QQ Q A
2 QQ Q A
B
change index name
0 LL L B
C
change index name
0 QQ Q C
1 LL L C
2 LL LL C
3 QQ Q C
4 LL L C
5 LL LL C
Then the largest piece of memory you will have will be dictated by the largest contiguous group and not the size of the file.
You can use 1/2 the memory of that method by appending to the data frame row by row instead of building the intermediate dict:
with open('/tmp/so.csv') as f:
header=next(f).split()
for k, segment in groupby(f, key=lambda line: line.split()[0]):
seg_data={k:[] for k in header}
seg_fram=pd.DataFrame(columns=header)
for idx, e in enumerate(segment):
df=pd.DataFrame({k:v for k, v in zip(header, e.split())}, index=[idx])
seg_fram=seg_fram.append(df)
(might be slower though...)
If that does not work, consider using a disk database.
I have a number of records in a text file that represent days of a 'month' 1-30 and whether a shop is open or closed. The letters represent the shop.
A 00000000001000000000000000000
B 11000000000000000000000000000
C 00000000000000000000000000000
D 00000000000000000000000000000
E 00000000000000000000000000000
F 00000000000000000000000000000
G 00000000000000000000000000000
H 00000000000000000000000000000
I 11101111110111111011111101111
J 11111111111111111111111111111
K 00110000011000001100000110000
L 00010000001000000100000010000
M 00100000010000001000000100000
N 00000000000000000000000000000
O 11011111101111110111111011111
I want to store the 1's and 0's as is in an array (I'm thinking numpy but there is a another way (string, bitstring) I'd be happy with that). Then I want to be able to slice one day , i.e a column and get the record keys back in a set.
e.g.
A 1
B 0
C 0
D 0
E 0
F 0
G 0
H 0
I 0
J 1
K 1
L 1
M 0
N 0
O 1
day10 = {A,J,K,L,O}
I also need this to be as performant as absolutely possible.
Simplest solution I've come up with:
shops = {}
with open('input.txt', 'r') as f:
for line in f:
name, month = line.strip().split()
shops[name] = [d == '1' for d in month]
dayIndex = 14
result = [s for s,v in shops.iteritems() if v[dayIndex]]
print "Shops opened at",dayIndex,":",result
A numpy solution:
stores, isopen = np.genfromtxt('input.txt', dtype="S30", unpack=True)
isopen = np.array(map(list, isopen)).astype(bool)
Then,
>>> stores[isopen[:,10]]
array(['A', 'J', 'K', 'L', 'O'],
dtype='|S30')
with open("datafile") as fin:
D = {i[0]:int(i[:1:-1], 2) for i in fin}
days = [{k for k in D if D[k] & 1<<i} for i in range(31)]
Just keep the days variable between queries
First, I would hesitate to write the amount of code to make things work for example for bitarray.
Second, I already upvoted BartoszKP's answer as it looks like a reasonable approach.
Last, I would use pandas instead of numpy for such a task, as for most tasks it will use underlying numpy functions and will reasonable fast.
If data contains your array as string, converting to DataFrame can be done with
>>> df = pd.DataFrame([[x] + map(int, y)
... for x, y in [l.split() for l in data.splitlines()]])
>>> df.columns = ['Shop'] + map(str, range(1, 30))
and lookups are done with
>>> df[df['3']==1]['Shop']
8 I
9 J
10 K
12 M
Name: Shop, dtype: object
Use a multilayered dictionary:
all_shops = { 'shopA': { 1: True, 2: False, 3: True ...},
.......}
Then your query is translated to
def query(shop_name, day):
return all_shops[shop_name][day]
with open("datafile") as f:
for line in f:
shop, _days = line.split()
for i,d in enumerate(_days):
if d == '1':
days[i].add(shop)
Simpler, faster and answers the question