Double for loop over Pandas df - python

I have a dataframe containing a column of subreddits and another column containing authors who have commented in that subreddit. Here is a snapshot:
subreddit user
0xProject [7878ayush, Mr_Yukon_C, NomChompsky92, PM_ME_Y...
100sexiest [T10rock]
100yearsago [PM_ME_MII, Quisnam]
1022 [MikuWaifuForLaifu, ghrshow, johnnymn1]
1200isjerky [Rhiann0n, Throwaway412160987]
1200isplenty [18hourbruh, Bambi726, Cosmiicao, Gronky_Kongg...
1200isplentyketo [yanqi83]
12ozmouse [ChBass]
12thMan [8064r7, TxAg09, brb1515]
12winArenaLog [fnayr]
13ReasonsWhy [SawRub, _mw8, morbs4]
13or30 [BOTS_RISE_UP, mmcjjc]
14ers [BuccoFan8]
1500isplenty [nnowak]
15SecondStories [DANKY-CHAN, NORMIESDIE]
18650masterrace [Airazz]
18_19 [-888-, 3mb3r89, FuriousBiCurious, FusRohDoing...
1911 [EuphoricaI, Frankshungry, SpicyMagnum23, cnw4...
195 [RobDawg344, ooi_]
19KidsandCounting [Kmw134, Lvzv, mpr1011, runjanarun]
1P_LSD [420jazz, A1M8E7, A_FABULOUS_PLUM, BS_work, EL...
2007oneclan [J_D_I]
2007scape [-GrayMan-, -J-a-y-, -Maxy-, 07_Tank, 0ipopo, ...
2010sMusic [Vranak]
21savage [Uyghur1]
22lr [microphohn]
23andme [Nimushiru, Pinuzzo, Pugmas, Sav1025, TOK715, ...
240sx [I_am_a_Dan, SmackSmackk, jimmyjimmyjimmy_, pr...
24CarrotCraft [pikaras]
24hoursupport [GTALionKing, Hashi856, Moroax, SpankN, fuck_u...
...
youtubetv [ComLaw, P1_1310, kcamacho11]
yoyhammer [Emicrania, Jbugman, RoninXiC, Sprionk, jonow83]
ypp [Loxcam]
ypsi [FLoaf]
ytp [Profsano]
yugijerk [4sham, Exos_VII]
yugioh [1001puppys, 6000j, 8512332158, A_fiSHy_fish, ...
yumenikki [ripa9]
yuri [COMMENTS_ON_NSFW_PIC, MikuxLuka401, Pikushibu...
yuri_jp [Pikushibu]
yuruyuri [ACG_Yuri, KirinoNakano, OSPFv3, SarahLia]
zagreb [jocus985]
zcoin [Fugazi007]
zec [Corm, GSXP, JASH_DOADELESS_, PSYKO_Inc, infinis]
zedmains [BTZx2, EggyGG, Ryan_A121, ShacObama, Tryxi, m...
zelda [01110111011000010111, Aura64, AzaraAybara, BA...
zen [ASAMANNAMMEDNIGEL, Cranky_Kong, Dhammakayaram...
zerocarb [BigBrain007, Manga-san, vicinius]
zetime [xrnzrx]
zfs [Emachina, bqq100, fryfrog, michio_kakus_hair,...
ziftrCOIN [GT712]
zoemains [DrahaKka, OJSaucy, hahAAsuo, nysra, x3noPLEB,...
zombies [carbon107, rjksn]
zomby [jwccs46]
zootopia [BCRE8TVE, Bocaj1000, BunnyMakingAMark, Far414...
zumba [GabyArcoiris]
zyramains [Dragonasaur, Shaiaan]
zyzz [Xayv]
I am trying to iterate over every subreddit and then iterate over every subreddit beneath that to find shared commenters. The end goal is a dataframe containing subreddit 1, subreddit 2, and the number of shared commenters.
I can't even conceive of how to do this using apply, and not sure how to do a double for loop with pandas df's.
Is this the right idea?
for i in df2.index:
subreddit = df2.get_value(i,'subreddit')
for i+1 in df2.index:
...
Here's an example of input and intended output:
df = pd.DataFrame({'subreddit': ['sub1', 'sub2', 'sub3', 'sub4'],
'user': [['A', 'B', 'C'], ['A', 'F', 'C'], ['F', 'E', 'D'], ['X', 'Y', 'Z']]})
Output for first subreddit:
subreddit_1 subreddit_2 shared_users
sub1 sub2 2
sub1 sub3 0
sub1 sub4 0

I don't know if you can get around using loops. This seems pretty similar to how you would calculate a correlation matrix, which uses loops in the pandas documentation. At least it's symmetric so you only have to compare half of them.
Instead of calculating a correlation, you want to find the number of elements that are shared between two lists lst1 and lst2 which is len(set(lst1) & set(lst2))
import pandas as pd
import numpy as np
df = pd.DataFrame({'subreddit': ['sub1', 'sub2', 'sub3', 'sub4'],
'user': [['A', 'B', 'C'], ['A', 'F', 'C'], ['F', 'E', 'D'], ['X', 'Y', 'Z']]})
mat = df.user
cols = df.subreddit
idx = cols.copy()
K = len(cols)
correl = np.empty((K, K), dtype=int)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
c = len(set(ac) & set(bc))
correl[i, j] = c
correl[j, i] = c
overlap_df = pd.DataFrame(correl, index=idx, columns=cols)
#subreddit sub1 sub2 sub3 sub4
#subreddit
#sub1 3 2 0 0
#sub2 2 3 1 0
#sub3 0 1 3 0
#sub4 0 0 0 3
And if you want to get those smaller DataFrames then you just need a little bit of manipulation. For example:
overlap_df.index.name='subreddit_1'
overlap_df[['sub1']].stack().reset_index().rename(columns={0: 'shared_users'})
subreddit_1 subreddit shared_users
0 sub1 sub1 3
1 sub2 sub1 2
2 sub3 sub1 0
3 sub4 sub1 1

Related

Replace value based on a corresponding value but keep value if criteria not met

Given the following dataframe,
INPUT df:
Cost_centre
Pool_costs
90272
A
92705
A
98754
A
91350
A
Replace Pool_costs value with 'B' given the Cost_centre value but keep the Pool_costs value if the Cost_centre value does not appear in list.
OUTPUT df:
Cost_centre
Pool_costs
90272
B
92705
A
98754
A
91350
B
Current Code:
This code works up until the else side of lambda; finding the Pool_costs value again is the hard part.
df = pd.DataFrame({'Cost_centre': [90272, 92705, 98754, 91350],
'Pool_costs': ['A', 'A', 'A', 'A']})
pool_cc = ([90272,91350])
pool_cc_set = set(pool_cc)
df['Pool_costs'] = df['Cost_centre'].apply(lambda x: 'B' if x in pool_cc_set else df['Pool_costs'])
print (df)
I have used the following and have found success but it gets hard to read and modify when there are a lot of cost_centre's to change.
df = pd.DataFrame({'Cost_centre': [90272, 92705, 98754, 91350],
'Pool_costs': ['A', 'A', 'A', 'A']})
filt = df['Cost_centre'] == '90272'|df['Cost_centre'] == '91350')
df.loc[filt, 'Pool_costs'] = 'B'
IIUC, you can use isin
filt = df['Cost_centre'].isin([90272, 91350])
df.loc[filt, 'Pool_costs'] = 'B'
print(df)
Cost_centre Pool_costs
0 90272 B
1 92705 A
2 98754 A
3 91350 B

handling a geohash dict look up with spatial joins

I have a dictionary with geohash as keys and a value associated with them. I am looking up values from the dict to create a new column in my pandas dataframe.
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
df['value'] = df.apply(lambda x : geo_dict[x.geohash], axis=1)
I need to be able to handle non-matches, i.e geohashes that do not exist in the dictionary. Expected handling below:
Find k-number of geohashes nearby and compute the mean value
Assign the mean of neighboring geohashes to pandas column
Questions -
Is there a library I can use to find nearby geohashes?
How do I code up this solution?
The module pygeodesy has several functions to calculate distance between geohashes. We can wrap this in a function that first checks is a match exists in the dict, else returns the mean value of the n closest geohashes:
import pygeodesy as pgd
import pandas as pd
geo_dict = {'9q5dx': 10, '9q9hv': 15, '9q5dv': 20}
geo_df = pd.DataFrame(zip(geo_dict.keys(), geo_dict.values()), columns=['geohash', 'value'])
df = pd.DataFrame({'geohash': ['9q5dx','9qh0g','9q9hv','9q5dv'],
'label': ['a', 'b', 'c', 'd']})
def approximate_distance(geohash1, geohash2):
return pgd.geohash.distance_(geohash1, geohash2)
#return pgd.geohash.equirectangular_(geohash1, geohash2) #alternative ways to calculate distance
#return pgd.geohash.haversine_(geohash1, geohash2)
def get_value(x, n=2): #set number of closest geohashes to use for approximation with n
val = geo_df.loc[geo_df['geohash'] == x]
if not val.empty:
return val['value'].iloc[0]
else:
geo_df['tmp_dist'] = geo_df['geohash'].apply(lambda y: approximate_distance(y,x))
return geo_df.nlargest(n, 'tmp_dist')['value'].mean()
df['value'] = df['geohash'].apply(get_value)
result:
geohash
label
value
0
9q5dx
a
10
1
9qh0g
b
12.5
2
9q9hv
c
15
3
9q5dv
d
20

How to convert the if else logic into dynamic selection?

I am writing logic to compare a few values.
I have three lists of values and one rule list
new_values = [1,0,0,0,1,1]
old_1 = [1,1,1,0,0,1]
old_2 = [1,0,1,0,1,1]
# when a is correct #when b is correct # if both are correct # if both are wrong
rules = ['a', 'b', 'combine', 'leave']
What I am looking for is, compare new_values to old_1 and old_2 values based on that select rule from rules list.
something like this:
def logic(new_values, old_values, rules):
rules_result = []
for new_value, old_value_1, old_value_2 in zip(new_values, old_values[0], old_values[1]):
if new_value == old_value_1 and new_value == old_value_2:
# if both are correct
rules_result.append(rules[2])
elif new_value == old_value_1:
# if a is correct
rules_result.append(rules[0])
elif new_value == old_value_2:
# if b is correct
rules_result.append(rules[1])
elif new_value!= old_value_1 and new_value!= old_value_2:
# if both are wrong
rules_result.append(rules[3])
return rules_result
Running this code with one rule list gives me this result :
logic(new_values, [old_1, old_2], rules)
output
['combine', 'b', 'leave', 'combine', 'b', 'combine']
I am facing issue to make this code dynamic if I have to compare more than two old values list, let say If I have three lists of old values and then my rule list will expand for each combination
new_values = [1,0,0,0,1,1]
old_1 = [1,1,1,0,0,1]
old_2 = [1,0,1,0,1,1]
old_3 = [0,0,0,1,1,1]
# when a is correct #when b is correct # if a and b are correct # if a and c are correct #if b and c are correct' #if all three are correct # if all three are wrong
rules = ['a', 'b', 'combine a_b', 'select c', 'combine b_c', 'select a', 'combine']
I am getting rules and values from a different function, I am looking for a rule selection function, where pass the list of old values ( example 2,3,4 list ) with new value and rule list, then dynamically compare each old list with new value list and select the rule from rule list.
How to make logic function dynamic to work on more than two old list values?
This problem could be solved easily if you use the concept of truth table. Your rules list defines the outcome for some boolean values. It doesn't consist of 1's and 0's so it can't be expressed by truth functions like and, or, xor but it's not a problem. You can simply rearrange your list by considering the order in the truth table:
# for 2 boolean variables, there are 2 ^ 2 = 4 possibilities
# ab ab ab ab
# 00 01 10 11
rules = ["leave", "b", "a", "combine"]
You can also turn this into a dict so you don't need to comment them to remember which one is what (and as a bonus, it will look like a truth table :)):
# ab
rules = {"00": "leave",
"01": "b",
"10": "a",
"11": "combine"}
Now, define a function to get the related key value for your boolean variables:
def get_rule_key(reference, values):
""" compares all the values against reference and returns a string for the result"""
return "".join(str(int(value == reference)) for value in values)
And your logic function will be simply this:
def logic(new_values, old_values, rules):
rules_result = []
for new_value, *old_values in zip(new_values, *old_values):
key = get_rule_key(new_value, old_values)
rules_result.append(rules.get(key))
return rules_result
print(logic(new_values, [old_1, old_2], rules))
# ['combine', 'b', 'leave', 'combine', 'b', 'combine']
For triples update your rules accordingly:
# for 3 boolean variables, there are 2 ^ 3 = 8 possibilities
# abc
rules = { "000": "combine",
# "001": Not defined in your rules,
"010": "b",
"011": "combine b_c",
"100": "a",
"101": "select c",
"110": "combine a_b"}
"111": "select a"}
print(logic(new_values, [old_1, old_2, old_3], rules))
# ['combine a_b', 'combine b_c', None, 'combine a_b', 'combine b_c', 'select a']
Notes:
None appears in the output because your rules doesn't define what is the output for "001" and dict.get returns None by default.
If you want to use a list to define the rules you have to define all the rules in order and convert the result of get_rule_key to an integer: "011" -> 3. You can manage this with int(x, base=2).
With unknown inputs it will be difficult to get this labels you specify.
It would be easy to map which ouf the old values corresponds to the same new value (positionally speaking). You can use a generic test function that gets a "new" value and all "old" values on that position, map the old values to 'a'... and return which ones correlate:
new_values = [1,0,0,0,1,1]
old_1 = [1,1,1,0,0,1]
old_2 = [1,0,1,0,1,1]
old_3 = [0,0,0,1,1,1]
old_4 = [0,0,0,1,1,1]
old_5 = [0,0,0,1,1,1]
def test(args):
nv, remain = args[0], list(args[1:])
start = ord("a")
# create dict from letter to its corresponding value
rv = {chr(start + i):v for i,v in enumerate(remain)}
# return tuples of the input and the matching outputs
return ((nv,remain), [k for k,v in rv.items() if v == nv])
rv = []
for values in zip(new_values, old_1, old_2, old_3, old_4, old_5):
rv.append(test(values))
print(*rv,sep="\n")
print([b for a,b in rv])
Output (manually spaced out):
# nv old_1 old_2 old_3 old_4 old_5
# a b c d e
((1, [ 1, 1, 0, 0, 0]), ['a', 'b'])
((0, [ 1, 0, 0, 0, 0]), ['b', 'c', 'd', 'e'])
((0, [ 1, 1, 0, 0, 0]), ['c', 'd', 'e'])
((0, [ 0, 0, 1, 1, 1]), ['a', 'b'])
((1, [ 0, 1, 1, 1, 1]), ['b', 'c', 'd', 'e'])
((1, [ 1, 1, 1, 1, 1]), ['a', 'b', 'c', 'd', 'e'])
[['a', 'b'], ['b', 'c', 'd', 'e'], ['c', 'd', 'e'], ['a', 'b'],
['b', 'c', 'd', 'e'], ['a', 'b', 'c', 'd', 'e']]
You could then map the joined results to some output:
# mapping needs completion - you'll need to hardcode that
# but if you consider inputs up to 5 old values, you need to specify those
# combinations that can happen for 5 as 4,3,2 are automagically included
# iif you omit "all of them" as result.
mapper = {"a": "only a", "ab": "a and b", "ac":"a and c", "abcde": "all of them",
# complete to your statisfaction on your own}
for inp, outp in rv:
result = ''.join(outp)
print(mapper.get(result , f"->'{result}' not mapped!"))
to get an output of:
a and b
->'bcde' not mapped!
->'cde' not mapped!
a and b
->'bcde' not mapped!
all of them

User input to create a column in Pandas DataFrame

I have a pandas DataFrame:
sample_data = {'Sample': ['A', 'B', 'A', 'B'],
'Surface': ['Top', 'Bottom', 'Top', 'Bottom'],
'Intensity' : [21, 32, 14, 45]}
sample_dataframe = pd.DataFrame(data=sample_data)
And I have a function to get user input to create a column with a 'Condition' for each 'Sample'
def get_choice(df, column):
#df['Condition'] = user_input
user_input = []
for i in df[column]:
print('\n', i)
user_input.append(input('Condition= '))
df['Condition'] = user_input
return df
get_choice(group_fname, 'Sample')
This works, however the the user is prompted for each row that a 'Sample' exists. It is not a problem in this example where the Samples have two rows each, but when the DataFrame is larger and there are multiple samples that occupy multiple rows then it gets tedious.
How do I create a function that will fill the 'Condition' column for each row that a 'Sample' occupies by just getting the input once.
I tried creating the function to return a dictionary then .apply() that to the DataFrame, but when I do that it still asks for input for each instance of the 'Sample'.
If I understand your question right, you want to get user input only once for each unique value and then create column 'Condition':
sample_data = {'Sample': ['A', 'B', 'A', 'B'],
'Surface': ['Top', 'Bottom', 'Top', 'Bottom'],
'Intensity' : [21, 32, 14, 45]}
sample_dataframe = pd.DataFrame(data=sample_data)
def get_choice(df, column):
m = {}
for v in df[column].unique():
m[v] = input('Condition for [{}] = '.format(v))
df['Condition'] = df[column].map(m)
return df
print( get_choice(sample_dataframe, 'Sample') )
Prints (for example)
Condition for [A] = 1
Condition for [B] = 2
Sample Surface Intensity Condition
0 A Top 21 1
1 B Bottom 32 2
2 A Top 14 1
3 B Bottom 45 2

Google cloud NL API data to Pandas Dataframe

I‘m using Google NL API (sample_classify_text)
It's sending me data that I transformed into this format:
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
From here I'd like to build a Pandas df that looks like this:
a b c 1 2 3 url1
d 4 url2
Knowing that the number of results for each url is different (a,b,c = 3 results, d = 1 result) It seems that most of the time number of results < 4 but I'm not sure about this, so I'd like to keep it flexible.
I've tried a few things, but it gets pretty complicated. I'm wondering if there's an easy way to handle that?
Have you tried creating a Pandas DF directly from the list?
Such like:
import pandas as pd
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
df = pd.DataFrame(response_list)
The result of the print(df) is:
0 1 2
0 [a, b, c] [1, 2, 3] [url1]
1 [d] [4] [url2]
That's what I ended up doing.
Not the most elegant solution...
Please don't tell me this can be done with a one-liner :D
import pandas as pd
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
colum_0, colum_1, colum_2, colum_3, colum_4, colum_5, colum_6 = [None],[None],[None],[None],[None],[None],[None] #pour crer les colonnes
for main_list in response_list:
for idx_macro, sub_list in enumerate(main_list):
for idx, elem in enumerate(sub_list):
if idx_macro == 0:
if idx == 0:
colum_0.append(elem)
if idx == 1:
colum_1.append(elem)
if idx == 2:
colum_2.append(elem)
elif idx_macro == 1:
if idx == 0:
colum_3.append(elem)
if idx == 1:
colum_4.append(elem)
if idx == 2:
colum_5.append(elem)
elif idx_macro == 2:
colum_6.append(elem)
colum_lists = [colum_0, colum_1, colum_2, colum_3, colum_4, colum_5, colum_6]
longest_list = 3
colum_lists2 = []
for lst in colum_lists[:-1]: #skip urls
while len(lst) < longest_list:
lst.append(None)
colum_lists2.append(lst)
colum_lists2.append(colum_6) #add urls
df = pd.DataFrame(colum_lists2)
df = df.transpose()
df = df.drop(0)
display(df)

Categories

Resources