Insert dict into dataframe with loop - python

Fetching data from API with for loop, but only last row is showing. If i put print statement instead of d=, I get all records for some reason. How to populate a Dataframe with all values?
I tried with for loop and with append but keep getting wrong results
for x in elements:
url = "https://my/api/v2/item/" + str(x[number"]) + "/"
get_data = requests.get(url)
get_data_json = get_data.json()
d = {'id': [x["enumber"]],
'name': [x["value1"]["name"]],
'adress': [value2["adress"]],
'stats': [get_data_json["stats"][5]["rating"]]
}
df = pd.DataFrame(data=d)
df.head()
Result:
id name order adress rating
Only last row is showing, probably because it's overwriting until it comes to last element. Should I put another for loop somewhere or there is some obvious solution that I cannot see?

Put all your data into a list of dictionaries, then convert to a dataframe at the very end
At the top of your code write:
all_data = []
Then in your loop, after d = {...}, write
all_data.append(d)
Finally at the end (after the loop has finished)
df = pd.DataFrame(all_data)

Related

list index out of range when crawling data and adjust data

I am trying to crawl data from a list of url (1st loop) . And in each url (2nd loop), I want to adjust the product_reviews['reviews'] ( list) by adding more data. Here is my code :
import requests
import pandas as pd
df = pd.read_excel(r'C:\ids.xlsx')
ids = df['ids'].values.tolist()
link = 'https://www.real.de/product/%s/'
url_test = 'https://www.real.de/pdp-test/api/v1/%s/product-attributes/?offset=0&limit=500'
url_test1 = 'https://www.real.de/pdp-test/api/v1/%s/product-reviews/?offset=0&limit=500'
for i in ids:
product_id = requests.get(url_test %i).json()
product_reviews = requests.get(url_test1 %i).json()
for x in range(0,len(product_reviews['reviews']),1):
product_reviews['reviews'][x]['variantAttributes'].append(str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][1]['label'].replace(" m","").replace(",",".")))))
product_reviews['reviews'][x]['variantAttributes'].append(str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][0]['label'].replace(" m","").replace(",",".")))))
product_reviews['reviews'][x]['size']= str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][1]['label'].replace(" m","").replace(",","."))))+ 'x' + str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][0]['label'].replace(" m","").replace(",","."))))
product_reviews['reviews'][x]['url'] = link %i
product_reviews['reviews'][x]['ean'] = product_id['defaultAttributes'][0]['values'][0]['text']
product_reviews['reviews'][x]['TotalReviewperParent'] = product_reviews['totalReviews']
df = pd.DataFrame(product_reviews['reviews'])
df.to_excel( r'C:\new\str(i).xlsx', index=False)
However when I run this code, it returns error :
line 24, in
product_reviews['reviews'][x]['variantAttributes'].append(str(int(100*float(product_reviews['reviews'][x]['variantAttributes'][1]['label'].replace(" m","").replace(",",".")))))
IndexError: list index out of range
When I run the 2nd loop for 1 url, it runs fine, however when I put 2nd loop inside 1st loop, it returns error. What is the solution for it ? And my code seems so monkey. Do you know how to improve my code so it can be shorter ?
Please, in the future, try to create a Minimal, Reproducible Example. We don't have access to your 'ids.xlsx' so we can't verify if the problem is with a specific id in your list or a general problem.
Taking a random id, 338661983, and using the following code:
import requests
link = 'https://www.real.de/product/%s/'
url_attributes = 'https://www.real.de/pdp-test/api/v1/%s/product-attributes/?offset=0&limit=500'
url_reviews = 'https://www.real.de/pdp-test/api/v1/%s/product-reviews/?offset=0&limit=500'
ids = [338661983]
for i in ids:
product_id = requests.get(url_attributes % i).json()
product_reviews = requests.get(url_reviews % i).json()
for review in product_reviews['reviews']:
print(review)
break
I get the following output:
{'reviewId': 1119427, 'title': 'Klasse!', 'date': '11.11.2020', 'rating': 5, 'isVerifiedPurchase': True, 'text': 'Originale Switch, schnelle Lieferung. Alles Top ', 'variantAttributes': [], 'author': 'hm-1511917085', 'datePublished': '2020-11-11T20:09:41+01:00'}
Notice that variantAttributes is an empty list.
You get an IndexError because you try to take the element at position 1 of that empty list in:
review['variantAttributes'][1]['label'].replace(" m","").replace(",",".")

Rename a data frame name by adding the iteration value as suffix in a for loop (Python)

I have run the following Python code :
array = ['AEM000', 'AID017']
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'].isin(array)]
I run a regression model and extract the log-likelyhood value on each item of this array by a for loop :
for item in array:
USA_DATA_1D = USA_DATA10.loc[USA_DATA10['JOBSPECIALTYCODE'] == item]
formula = "WEIGHTED_BASE_MEDIAN_FINAL_MEAN ~ YEAR"
response, predictors = dmatrices(formula, USA_DATA_1D, return_type='dataframe')
mod1 = sm.GLM(response, predictors, family=sm.genmod.families.family.Gaussian()).fit()
LLF_NG = {'model': ['Standard Gaussian'],
'llf_value': mod1.llf
}
df_llf = pd.DataFrame(LLF_NG , columns = ['model', 'llf_value'])
Now I would like to remane the dataframe df_llf by df_llf_(name of the item) i.e. df_llf_AEM000 when running the loop on the first item and df_llf_AID017 when running the loop on the second one.
I need some help to know how to proceed that.
If you want to rename the data frame, you need to use the copy method so that the original data frame does not get altered.
df_llf_AEM000 = df_llf.copy()
If you want to save iteratively several different versions of the original data frame, you can do something like this:
allDataframes = []
for i in range(10):
df = df_original.copy()
allDataframes.append(df)
print(allDataframes[0])

Function not looping through all the values and stopping after first value in the loop

I am new to python so sorry if I'm not clear. I am trying to create a loop to get moving average of costs for different products. I have a data-set which has over 55000 products(my_product_id) and their cost for every month since 2019-01-01. I am trying to create a function to get moving average of the costs in the most recent 3 months. So far I have written this function which works but only runs for one product and stops the loop after that. I need to be able to run this function over all unique product ids in the column. This is what my data-set looks like
def abc(df_189):
dfObj = pd.DataFrame(columns=['my_product_id', 'Cost'])
my_products = df_189.my_product_id.unique()
for i in my_products:
df_test = df_189[df_189.my_product_id == i]
Grouped=df_test.groupby('date')
GetWeightAvg=lambda g: np.average(g['cost'], weights=g['quantity'])
pr=Grouped.apply(GetWeightAvg).sort_index(ascending=False).head(3).mean()
dfObj = dfObj.append({'my_product_id': i, 'Cost': pr}, ignore_index=True)
return dfObj
This returns a Dataframe of just one row of the first product ID so its working right but stopping after the first product.
Thanks in Advance:)
Try putting the return out of the loop like this:
def abc(df_189):
dfObj = pd.DataFrame(columns=['my_product_id', 'Cost'])
my_products = df_189.my_product_id.unique()
for i in my_products:
df_test = df_189[df_189.my_product_id == i]
Grouped=df_test.groupby('date')
GetWeightAvg=lambda g: np.average(g['cost'], weights=g['quantity'])
pr=Grouped.apply(GetWeightAvg).sort_index(ascending=False).head(3).mean()
dfObj = dfObj.append({'my_product_id': i, 'Cost': pr}, ignore_index=True)
return dfObj

iterate over list of dicts to create different strings

I have a pandas file with 3 different columns that I turn into a dictionary with to_dict, the result is a list of dictionaries:
df = [
{'HEADER1': 'col1-row1', 'HEADER2: 'col2-row1', 'HEADER3': 'col3-row1'},
{'HEADER1': 'col1-row2', 'HEADER2: 'col2-row2', 'HEADER3': 'col3-row2'}
]
Now my problem is that I need the value of 'col2-rowX' and 'col3-rowX' to build an URL and use requests and bs4 to scrape the websties.
I need my result to be something like the following:
requests.get("'http://www.website.com/' + row1-col2 + 'another-string' + row1-col3 + 'another-string'")
And i need to do that for every dictionary in the list.
I have tried iterating over the dictionaries using for-loops.
something like:
import pandas as pd
import os
os.chdir('C://Users/myuser/Desktop')
df = pd.DataFrame.from_csv('C://Users/myuser/Downloads/export.csv')
#Remove 'Code' column
df = df.drop('Code', axis=1)
#Remove 'Code2' as index
df = df.reset_index()
#Rename columns for easier manipulation
df.columns = ['CB', 'FC', 'PO']
#Convert to dictionary for easy URL iteration and creation
df = df.to_dict('records')
for row in df:
for key in row:
print(key)
You only ever iterate twice, and short-circuit out of the nested for loop every time it is executed by having a return statement there. Looking up the necessary information from the dictionary will allow you to build up your url's. One possible example:
def get_urls(l_d):
l=[]
for d in l_d:
l.append('http://www.website.com/' + d['HEADER2'] + 'another-string' + d['HEADER3'] + 'another-string')
return l
df = [{'HEADER1': 'col1-row1', 'HEADER2': 'col2-row1', 'HEADER3': 'col3-row1'},{'HEADER1': 'col1-row2', 'HEADER2': 'col2-row2', 'HEADER3': 'col3-row2'}]
print get_urls(df)
>>> ['http://www.website.com/col2-row1another-stringcol3-row1another-string', 'http://www.website.com/col2-row2another-stringcol3-row2another-string']

Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.
I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
"season":["summer","spring"]}
I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:
df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
for chunk in f:
jfile = json.loads(chunk)
if 'groupe' in jfile['location']:
groupe = jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id = jfile['id']
else:
id = np.nan
if 'MotherName' in jfile['Mother']:
MotherName = jfile['Mother']['MotherName']
else:
MotherName = np.nan
if 'FatherName' in jfile['Father']:
FatherName = jfile['Father']['FatherName']
else:
FatherName = np.nan
df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
ignore_index=True)
I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.
You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.
I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.
import numpy as np
import pandas as pd
import json
import time
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
#
# The original way
#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if 'groupe' in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id=jfile['id']
else:
id=np.nan
if 'MotherName' in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if 'FatherName' in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)
The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else structure with a simple get that returns a default value (e.g. np.nan) if the item is not found in the dictionary.
with open (path/to/file) as f:
d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
for chunk in f:
jfile = json.loads(chunk)
d['groupe'].append(jfile['location'].get('groupe', np.nan))
d['id'].append(jfile.get('id', np.nan))
d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))
df = pd.DataFrame(d)

Categories

Resources