Google cloud NL API data to Pandas Dataframe - python

I‘m using Google NL API (sample_classify_text)
It's sending me data that I transformed into this format:
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
From here I'd like to build a Pandas df that looks like this:
a b c 1 2 3 url1
d 4 url2
Knowing that the number of results for each url is different (a,b,c = 3 results, d = 1 result) It seems that most of the time number of results < 4 but I'm not sure about this, so I'd like to keep it flexible.
I've tried a few things, but it gets pretty complicated. I'm wondering if there's an easy way to handle that?

Have you tried creating a Pandas DF directly from the list?
Such like:
import pandas as pd
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
df = pd.DataFrame(response_list)
The result of the print(df) is:
0 1 2
0 [a, b, c] [1, 2, 3] [url1]
1 [d] [4] [url2]

That's what I ended up doing.
Not the most elegant solution...
Please don't tell me this can be done with a one-liner :D
import pandas as pd
response_list = [[['a', 'b', 'c'], [1,2,3], ['url1']], [['d'], [4], ['url2']]]
colum_0, colum_1, colum_2, colum_3, colum_4, colum_5, colum_6 = [None],[None],[None],[None],[None],[None],[None] #pour crer les colonnes
for main_list in response_list:
for idx_macro, sub_list in enumerate(main_list):
for idx, elem in enumerate(sub_list):
if idx_macro == 0:
if idx == 0:
colum_0.append(elem)
if idx == 1:
colum_1.append(elem)
if idx == 2:
colum_2.append(elem)
elif idx_macro == 1:
if idx == 0:
colum_3.append(elem)
if idx == 1:
colum_4.append(elem)
if idx == 2:
colum_5.append(elem)
elif idx_macro == 2:
colum_6.append(elem)
colum_lists = [colum_0, colum_1, colum_2, colum_3, colum_4, colum_5, colum_6]
longest_list = 3
colum_lists2 = []
for lst in colum_lists[:-1]: #skip urls
while len(lst) < longest_list:
lst.append(None)
colum_lists2.append(lst)
colum_lists2.append(colum_6) #add urls
df = pd.DataFrame(colum_lists2)
df = df.transpose()
df = df.drop(0)
display(df)

Related

Adding records into a dataframe using function

I want to adding new records into alldata file by concat newdata to it. Somehow the alldata get wiped out after each cycle. Can you please help to fix?
import schedule;import sched;import pandas as pd;
alldata=pd.DataFrame()
def myfunc(outside_file):
newdata = pd.DataFrame({'id':['A', 'c'],
'value': [20, 0]},
)
print(newdata)
outside_file =pd.concat([outside_file,newdata])
print(outside_file)
schedule.every().minute.at(':00').do(myfunc, alldata)
while True:
schedule.run_pending()
print(alldata)
time.sleep(30)
You're not modifying alldata: you're passing it as argument to a function, and what's being modified is the argument inside this function, not the global variable. Compare the following:
Snippet 1:
import pandas as pd
alldata = pd.DataFrame()
def run(outside_df):
df = pd.DataFrame({'id':['A', 'B'], 'value': [20, 0]})
outside_df = pd.concat([outside_df, df])
for _ in range(5):
run(alldata)
print(alldata)
# Empty DataFrame
# Columns: []
# Index: []
Snippet 2:
import pandas as pd
alldata = pd.DataFrame()
def run():
global alldata
df = pd.DataFrame({'id':['A', 'B'], 'value': [20, 0]})
alldata = pd.concat([alldata, df])
for _ in range(5):
run()
print(alldata)
# id value
# 0 A 20
# 1 B 0
# 0 A 20
# 1 B 0
# 0 A 20
# 1 B 0
# 0 A 20
# 1 B 0
# 0 A 20
# 1 B 0
Moral of the story: use global (or don't) when you want to modify a global variable inside a function.

How to transpose values from top few rows in python dataframe into new columns

I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()

Double for loop over Pandas df

I have a dataframe containing a column of subreddits and another column containing authors who have commented in that subreddit. Here is a snapshot:
subreddit user
0xProject [7878ayush, Mr_Yukon_C, NomChompsky92, PM_ME_Y...
100sexiest [T10rock]
100yearsago [PM_ME_MII, Quisnam]
1022 [MikuWaifuForLaifu, ghrshow, johnnymn1]
1200isjerky [Rhiann0n, Throwaway412160987]
1200isplenty [18hourbruh, Bambi726, Cosmiicao, Gronky_Kongg...
1200isplentyketo [yanqi83]
12ozmouse [ChBass]
12thMan [8064r7, TxAg09, brb1515]
12winArenaLog [fnayr]
13ReasonsWhy [SawRub, _mw8, morbs4]
13or30 [BOTS_RISE_UP, mmcjjc]
14ers [BuccoFan8]
1500isplenty [nnowak]
15SecondStories [DANKY-CHAN, NORMIESDIE]
18650masterrace [Airazz]
18_19 [-888-, 3mb3r89, FuriousBiCurious, FusRohDoing...
1911 [EuphoricaI, Frankshungry, SpicyMagnum23, cnw4...
195 [RobDawg344, ooi_]
19KidsandCounting [Kmw134, Lvzv, mpr1011, runjanarun]
1P_LSD [420jazz, A1M8E7, A_FABULOUS_PLUM, BS_work, EL...
2007oneclan [J_D_I]
2007scape [-GrayMan-, -J-a-y-, -Maxy-, 07_Tank, 0ipopo, ...
2010sMusic [Vranak]
21savage [Uyghur1]
22lr [microphohn]
23andme [Nimushiru, Pinuzzo, Pugmas, Sav1025, TOK715, ...
240sx [I_am_a_Dan, SmackSmackk, jimmyjimmyjimmy_, pr...
24CarrotCraft [pikaras]
24hoursupport [GTALionKing, Hashi856, Moroax, SpankN, fuck_u...
...
youtubetv [ComLaw, P1_1310, kcamacho11]
yoyhammer [Emicrania, Jbugman, RoninXiC, Sprionk, jonow83]
ypp [Loxcam]
ypsi [FLoaf]
ytp [Profsano]
yugijerk [4sham, Exos_VII]
yugioh [1001puppys, 6000j, 8512332158, A_fiSHy_fish, ...
yumenikki [ripa9]
yuri [COMMENTS_ON_NSFW_PIC, MikuxLuka401, Pikushibu...
yuri_jp [Pikushibu]
yuruyuri [ACG_Yuri, KirinoNakano, OSPFv3, SarahLia]
zagreb [jocus985]
zcoin [Fugazi007]
zec [Corm, GSXP, JASH_DOADELESS_, PSYKO_Inc, infinis]
zedmains [BTZx2, EggyGG, Ryan_A121, ShacObama, Tryxi, m...
zelda [01110111011000010111, Aura64, AzaraAybara, BA...
zen [ASAMANNAMMEDNIGEL, Cranky_Kong, Dhammakayaram...
zerocarb [BigBrain007, Manga-san, vicinius]
zetime [xrnzrx]
zfs [Emachina, bqq100, fryfrog, michio_kakus_hair,...
ziftrCOIN [GT712]
zoemains [DrahaKka, OJSaucy, hahAAsuo, nysra, x3noPLEB,...
zombies [carbon107, rjksn]
zomby [jwccs46]
zootopia [BCRE8TVE, Bocaj1000, BunnyMakingAMark, Far414...
zumba [GabyArcoiris]
zyramains [Dragonasaur, Shaiaan]
zyzz [Xayv]
I am trying to iterate over every subreddit and then iterate over every subreddit beneath that to find shared commenters. The end goal is a dataframe containing subreddit 1, subreddit 2, and the number of shared commenters.
I can't even conceive of how to do this using apply, and not sure how to do a double for loop with pandas df's.
Is this the right idea?
for i in df2.index:
subreddit = df2.get_value(i,'subreddit')
for i+1 in df2.index:
...
Here's an example of input and intended output:
df = pd.DataFrame({'subreddit': ['sub1', 'sub2', 'sub3', 'sub4'],
'user': [['A', 'B', 'C'], ['A', 'F', 'C'], ['F', 'E', 'D'], ['X', 'Y', 'Z']]})
Output for first subreddit:
subreddit_1 subreddit_2 shared_users
sub1 sub2 2
sub1 sub3 0
sub1 sub4 0
I don't know if you can get around using loops. This seems pretty similar to how you would calculate a correlation matrix, which uses loops in the pandas documentation. At least it's symmetric so you only have to compare half of them.
Instead of calculating a correlation, you want to find the number of elements that are shared between two lists lst1 and lst2 which is len(set(lst1) & set(lst2))
import pandas as pd
import numpy as np
df = pd.DataFrame({'subreddit': ['sub1', 'sub2', 'sub3', 'sub4'],
'user': [['A', 'B', 'C'], ['A', 'F', 'C'], ['F', 'E', 'D'], ['X', 'Y', 'Z']]})
mat = df.user
cols = df.subreddit
idx = cols.copy()
K = len(cols)
correl = np.empty((K, K), dtype=int)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
c = len(set(ac) & set(bc))
correl[i, j] = c
correl[j, i] = c
overlap_df = pd.DataFrame(correl, index=idx, columns=cols)
#subreddit sub1 sub2 sub3 sub4
#subreddit
#sub1 3 2 0 0
#sub2 2 3 1 0
#sub3 0 1 3 0
#sub4 0 0 0 3
And if you want to get those smaller DataFrames then you just need a little bit of manipulation. For example:
overlap_df.index.name='subreddit_1'
overlap_df[['sub1']].stack().reset_index().rename(columns={0: 'shared_users'})
subreddit_1 subreddit shared_users
0 sub1 sub1 3
1 sub2 sub1 2
2 sub3 sub1 0
3 sub4 sub1 1

Counting the number of specific characters ignoring duplicates: Python

I have an input like this: BFFBFBFFFBFBBBFBBBBFF .
I want to count 'B's and the answer should be 6.(ignore the duplicate ones)
How to do it in python?
Use itertools.groupby :
>>> from itertools import groupby
>>> l = [k for k,v in groupby(s)]
>>> l
=> ['B', 'F', 'B', 'F', 'B', 'F', 'B', 'F', 'B', 'F', 'B', 'F']
>>> l.count('B')
=> 6
#driver values :
IN : s = 'BFFBFBFFFBFBBBFBBBBFF
EDIT : Also, for more extensive use, its better to use collections.Counter to get count for all the characters.
>>> from collections import Counter
>>> Counter(l)
=> Counter({'B': 6, 'F': 6})
s = "BFFBFBFFFBFBBBFBBBBFF"
f = False
count = 0
for i in s:
if f and i == 'B':
continue
elif i == 'B':
count += 1
f = True
else:
f = False
print(count)
another
from itertools import groupby
count = 0
for i,_ in groupby(s):
if i == 'B':
count += 1
print(count)
You should set a counter and a flag variable. Then count only occurences which are not duplicates, and flip the flag. The logic is simple: if current letter is 'B', and you the letter before isn't 'B' (dup = False), then count it + flip the boolean:
s = 'BFFBFBFFFBFBBBFBBBBFF'
count = 0
dup = False
for l in s:
if l == 'B' and not dup:
count += 1
dup = True
elif l != 'B':
dup = False
# count: 6
We can remove consecutive dups and use collections.Counter to count the B's that are left:
from collections import Counter
def remove_conseq_dups(s):
res = ""
for i in range(len(s)-1):
if s[i] != s[i+1]:
res+= s[i]
return res
s = "BFFBFBFFFBFBBBFBBBBFF"
print(Counter(remove_conseq_dups(s))['B']) # 6
And a groupby solution:
from itertools import groupby
s = "BFFBFBFFFBFBBBFBBBBFF"
print(sum(map(lambda x: 1 if x == 'B' else 0, [x for x, v in groupby(s)])))
Or
print(len(list(filter(lambda x: x == 'B', [x for x, v in groupby(s)]))))
Another solution by first removing duplicates using RE-library:
import re
l1 = "BFFBFBFFFBFBBBFBBBBFF"
l2 = re.sub(r'([A-z])\1+', r'\1', l1) # Remove duplicates
l2.count("B") # 6
You want to count when the letters change from F to B, and another function can do that : split. It removes all Fs, but create empty strings for consecutive Fs, so we must remove them from the count.
s = "BFFBFBFFFBFBBBFBBBBFF"
t = s.split('F')
n = sum([1 for b in t if len(b) > 0])
print(n)
Alternative solution:
s = 'BFFBFBFFFBFBBBFBBBBFF'
l = [c for i,c in enumerate(s) if s[i-1] != c]
l.count('B') #or use counter
>>>6

Masking a DataFrame on multiple column conditions - inside a loop

I would like to mask my dataframe conditional on multiple columns inside a loop. I am trying to do something like this:
dfs = []
val_dict = {0: 'a', 1: 'b', 2: 'c', 3: 'd'}
for i in range(4):
items = [val_dict[i] for i in range(i+1)]
df_ = df[(df['0'] == items[0]) & (df['1'] == items[1]) & ... ]
dfs.append(df_)
Please note that the second condition I wrote above would not exist for the first iteration of the loop because there would be no items[1] element.
Here is a sample dataframe you are welcome to test on:
df = pd.DataFrame({'0': ['a']*3 + ['b']*3 + ['c']*3,
'1': ['a']*3 + ['b']*6,
'2': ['b']*4 + ['c']*5,
'3': ['c']*5 + ['d']*4})
The only solution I have come up with uses eval which I would like very much to avoid.
If you subset your DataFrame to include only the columns you want to use for comparison (as you have done in your example) and the keys in your val_dict are the same as the columns you want to compare, then you can get Pandas to do this for you.
Making a slight modification to your df
df = pd.DataFrame({0: ['a']*3 + ['b']*3 + ['c']*3,
1: ['a']*3 + ['d']*6,
2: ['b']*4 + ['c']*5,
3: ['c']*5 + ['a']*4})
You can now accomplish what you want by the following
dfs = []
val_dict = {0: 'a', 1: 'b', 2: 'c', 3: 'd'}
val_series = pd.Series(val_dict)
for i in range(4):
mask = (df == val_series).all(axis=1)
dfs.append(df[mask])
EDIT
I am leaving my original solution even though it addresses a different problem than OP intended to solve. The intended problem can be solved by the following:
mask = True
for key in range(4):
mask &= df[key] == val_dict[key]
dfs.append(df[mask])
Again, this is using the modified df used earlier in my original answer.
I'll share my eval solution.
for i in range(4):
items = [val_dict[i] for i in range(i+1)]
df_ = eval('df[(' + ') & ('.join(['df["'+str(j)+'"] == items['+str(j)+']' for j in range(i+1)]) + ')]')
dfs.append(df_)
It works... but so ugly :(

Categories

Resources