How can i solve this empty dataframe problem? - python

I'm trying to save some API request on a DataFrame, but after running the code, it continues empty. What am I missing? Could someone help me?
import requests
import json
import pandas as pd
from pandas import json_normalize
lista = []
lista = pd.DataFrame(lista)
contador = 0
data_inicial = 'https://public2-api2.ploomes.com/Deals'
while contador <= 3:
data_text = requests.get(data_inicial, headers={'User-Key':).text
json_object = json.loads(data_text)
json_formatted_str = json.dumps(json_object,indent=2)
data_teste = requests.get(data_inicial, headers={'User-Key':})
dictr = data_teste.json()
recs = dictr['value']
code = json_normalize(recs)
lista.append(code)
next_link = dictr['#odata.nextLink']
data_inicial = next_link
contador = contador + 1
continue
I deleted the header to avoid problems. But I think its something on the sintaxe.

When you use lista.append(code), it is appending to a data frame because of the earlier lines:
lista = []
lista = pd.DataFrame(lista)
Data frame appends are not in place. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html. It is also deprecated. Append to the list (which is in place) by omitting the second lista assignment, then pd.concat.

Related

Cannot save modifications made in xlsx file

I read a .xlsx file, update it but Im not able to save it
from xml.dom import minidom as md
[... some code ....]
sheet = workDir + '/xl/worksheets/sheet'
sheet1 = sheet + '1.xml'
importSheet1 = open(sheet1,'r')
whole_file= importSheet1.read()
data_Sheet = md.parseString(whole_file)
[... some code ....]
self.array_mem_name = []
y = 1
x = 5 #first useful row
day = int(day)
found = 0
while x <= len_array_shared:
readrow = data_Sheet.getElementsByTagName('row')[x]
c_data = readrow.getElementsByTagName('c')[0]
c_attrib = c_data.getAttribute('t')
if c_attrib == 's':
vName = c_data.getElementsByTagName('v')[0].firstChild.nodeValue
#if int(vName) != broken:
mem_name = self.array_shared[int(vName)]
if mem_name != '-----':
if mem_name == old:
c_data = readrow.getElementsByTagName('c')[day]
c_attrib = c_data.getAttribute('t')
if (c_attrib == 's'):
v_Attrib = c_data.getElementsByTagName('v')[0].firstChild.nodeValue
if v_Attrib != '':
#loc = self.array_shared[int(v_Attrib)]
index = self.array_shared.index('--')
c_data.getElementsByTagName('v')[0].firstChild.nodeValue = index
with open(sheet1, 'w') as f:
f.write(whole_file)
As you can see I use f.write(whole_file) but whole_file has not the changes made with index.
Checking the debug I see that the new value has been added to the node, but I can't save sheet1 with the modified value
I switched to using openpyxl instead, as was suggested in a comment by Lei Yang. I found that this tool worked better for my jobs. With openpyxl, reading cell values is much easier than with xml.dom.minidom.
My only concern is that openpyxl seems really slower than the dom to load the workbook. Maybe the memory was overloaded. But, I was more interested in using something simpler than this minor performance issue.

Fastest way to iterate function over pandas dataframe

I have a function which operates over lines of a csv file, adding values of different cells to dictionaries depending on whether conditions are met:
df = pd.concat([pd.read_csv(filename) for filename in args.csv], ignore_index = True)
ID_Use_Totals = {}
ID_Order_Dates = {}
ID_Received_Dates = {}
ID_Refs = {}
IDs = args.ID
def TSQs(row):
global ID_Use_Totals, ID_Order_Dates, ID_Received_Dates
if row['Stock Item'] not in IDs:
pass
else:
if row['Action'] in ['Order/Resupply', 'Cons. Purchase']:
if row['Stock Item'] not in ID_Order_Dates:
ID_Order_Dates[row['Stock Item']] = [{row['Ref']: pd.to_datetime(row['TransDate'])}]
else:
ID_Order_Dates[row['Stock Item']].append({row['Ref']: pd.to_datetime(row['TransDate'])})
elif row['Action'] == 'Received':
if row['Stock Item'] not in ID_Received_Dates:
ID_Received_Dates[row['Stock Item']] = [{row['Ref']: pd.to_datetime(row['TransDate'])}]
else:
ID_Received_Dates[row['Stock Item']].append({row['Ref']: pd.to_datetime(row['TransDate'])})
elif row['Action'] == 'Use':
if row['Stock Item'] in ID_Use_Totals:
ID_Use_Totals[row['Stock Item']].append(row['Qty'])
else:
ID_Use_Totals[row['Stock Item']] = [row['Qty']]
else:
pass
Currently, I am doing:
for index, row in df.iterrows():
TSQs(row)
But timer() returns between 70 and 90 seconds for a 40,000 line csv file.
I want to know what the fastest way of implementing this is over the entire dataframe (which could potentially be hundreds of thousands of rows).
I'd wager not using Pandas for this could be faster.
Additionally, you can use defaultdicts to avoid having to check whether you've seen a given product yet:
import csv
import collections
import datetime
ID_Use_Totals = collections.defaultdict(list)
ID_Order_Dates = collections.defaultdict(list)
ID_Received_Dates = collections.defaultdict(list)
ID_Refs = {}
IDs = set(args.ID)
order_actions = {"Order/Resupply", "Cons. Purchase"}
for filename in args.csv:
with open(filename) as f:
for row in csv.DictReader(f):
item = row["Stock Item"]
if item not in IDs:
continue
ref = row["Ref"]
action = row["Action"]
if action in order_actions:
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Order_Dates[item].append({ref: date})
elif action == "Received":
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Received_Dates[item].append({ref: date})
elif action == "Use":
ID_Use_Totals[item].append(row["Qty"])
EDIT: If the CSV is really of the form
"Employee", "Stock Location", "Stock Item"
"Ordered", "16", "32142"
the stock CSV module can't quite parse it.
You could use Pandas to parse the file, then iterate over rows, though I'm not sure if this'll end up being much faster in the end:
import collections
import datetime
import pandas
ID_Use_Totals = collections.defaultdict(list)
ID_Order_Dates = collections.defaultdict(list)
ID_Received_Dates = collections.defaultdict(list)
ID_Refs = {}
IDs = set(args.ID)
order_actions = {"Order/Resupply", "Cons. Purchase"}
for filename in args.csv:
for index, row in pd.read_csv(filename).iterrows():
item = row["Stock Item"]
if item not in IDs:
continue
ref = row["Ref"]
action = row["Action"]
if action in order_actions:
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Order_Dates[item].append({ref: date})
elif action == "Received":
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Received_Dates[item].append({ref: date})
elif action == "Use":
ID_Use_Totals[item].append(row["Qty"])
You can use the apply function. The code will look like this:
df.apply(TSQs, axis=1)
Here when axis=1, each row will be sent to the function TSQs as a pd.Series from where you can index like row["Ref"] to get value of that line. Since this is a vector operation, it will run so much after that a for loop.
Probably fastest not to iterate at all:
# Build some boolean indices for your various conditions
idx_stock_item = df["Stock Item"].isin(IDs)
idx_purchases = df["Action"].isin(['Order/Resupply', 'Cons. Purchase'])
idx_order_dates = df["Stock Item"].isin(ID_Order_Dates)
# combine the indices to act on specific rows all at once
idx_combined = idx_stock_item & idx_purchases & ~idx_order_dates
# It looks like you were putting a single entry dictionary in each row - wouldn't it make sense to rather just use two columns? i.e. take advantage of the DataFrame data structure
ID_Order_Dates.loc[df.loc[idx_combined, "Stock Item"], "Ref"] = df.loc[idx_combined, "Ref"]
ID_Order_Dates.loc[df.loc[idx_combined, "Stock Item"], "Date"] = df.loc[idx_combined, "TransDate"]
# repeat for your other cases
# ...

How to create variables to loop over files and emerge in dataframe?

I want to create a DataFrame with data for Tennis matches of a specific player 'Lenny Hampel'. For this I downloaded a lot of .json files with data for of his matches - all in all there are around 100 files. As it is a json file i need to convert every single file into a dict, to get it into the dataframe in the end. Finally I need to concatenate each file to the dataframe. I could hard-code it, however it is kind of silly I think, but I could not find a proper way to iterate trough this.
Could you help me understand how I could create a loop or smth else in order to code it the smart way?
from bs4 import BeautifulSoup
import requests
import json
import bs4 as bs
import urllib.request
from urllib.request import Request, urlopen
import pandas as pd
import pprint
with open('lenny/2016/lenny2016_match (1).json') as json_file:
lennymatch1 = json.load(json_file)
player = [item
for item in lennymatch1["stats"]
if item["player_fullname"] == "Lenny Hampel"]
with open('lenny/2016/lenny2016_match (2).json') as json_file:
lennymatch2 = json.load(json_file)
player2 = [item
for item in lennymatch2["stats"]
if item["player_fullname"] == "Lenny Hampel"]
with open('lenny/2016/lenny2016_match (3).json') as json_file:
lennymatch3 = json.load(json_file)
player33 = [item
for item in lennymatch3["stats"]
if item["player_fullname"] == "Lenny Hampel"]
with open('lenny/2016/lenny2016_match (4).json') as json_file:
lennymatch4 = json.load(json_file)
player4 = [item
for item in lennymatch4["stats"]
if item["player_fullname"] == "Lenny Hampel"]
tabelle1 = pd.DataFrame.from_dict(player)
tabelle2 = pd.DataFrame.from_dict(player2)
tabelle3 = pd.DataFrame.from_dict(player33)
tabelle4 = pd.DataFrame.from_dict(player4)
tennisstats = [tabelle1, tabelle2, tabelle3, tabelle4]
result = pd.concat(tennisstats)
result
Well, it seems so basic knowledge that I don't understand why you ask for this.
# --- before loop ---
tennisstats = []
# --- loop ---
for filename in ["lenny/2016/lenny2016_match (1).json", "lenny/2016/lenny2016_match (2).json"]:
with open(filename) as json_file:
lennymatch = json.load(json_file)
player = [item
for item in lennymatch["stats"]
if item["player_fullname"] == "Lenny Hampel"]
tabele = pd.DataFrame.from_dict(player)
tennisstats.append(tabele)
# --- after loop ---
result = pd.concat(tennisstats)
If filenames are similar and they have only different number
for number in range(1, 101):
filename = f"lenny/2016/lenny2016_match ({number}).json"
with open(filename) as json_file:
and rest is the same as in first version.
If all files are in the same folder then maybe you should use os.listdir()
directory = "lenny/2016/"
for name in os.listdir(directory):
filename = directory + name
with open(filename) as json_file:
and rest is the same as in first version.

"Remove" in else clause changes results of loop over json dict

I am iterating over a dict created from a json file which works fine but as soon as I remove some of the entries in the else clause the results change (normally it prints 35 nuts_ids but with the remove in the else only 32 are printed. So it seems that the remove influences the iterating but why? The key should be safe? How can I do this appropriately without loosing data?
import json
with open("test.json") as json_file:
json_data = json.load(json_file)
for g in json_data["features"]:
poly = g["geometry"]
cntr_code = g["properties"]["CNTR_CODE"]
nuts_id = g["properties"]["NUTS_ID"]
name = g["properties"]["NUTS_NAME"]
if cntr_code == "AT":
print(nuts_id)
# do plotting etc
else: # delete it if it is not part a specific country
json_data["features"].remove(g) # line in question
# do something else with the json_data
Not a good practice to delete items while iterating the object. Instead you can try filtering out the elements you do need.
Ex:
import json
with open("test.json") as json_file:
json_data = json.load(json_file)
json_data_features = [g for g in json_data["features"] if g["properties"]["CNTR_CODE"] == "AT"] #Filter other country codes.
json_data["features"] = json_data_features
for g in json_data["features"]:
poly = g["geometry"]
cntr_code = g["properties"]["CNTR_CODE"]
nuts_id = g["properties"]["NUTS_ID"]
name = g["properties"]["NUTS_NAME"]
# do plotting etc
# do something else with the json_data
Always remember the cardinal rule, never modify objects you are iterating on
You can take a copy of your dictionary and then iterate on it using copy.copy
import json
import copy
with open("test.json") as json_file:
json_data = json.load(json_file)
#Take copy of json_data
json_data_copy = json_data['features'].copy()
#Iterate on the copy
for g in json_data_copy:
poly = g["geometry"]
cntr_code = g["properties"]["CNTR_CODE"]
nuts_id = g["properties"]["NUTS_ID"]
name = g["properties"]["NUTS_NAME"]
if cntr_code == "AT":
print(nuts_id)
# do plotting etc
else: # delete it if it is not part a specific country
json_data["features"].remove(g) # line in question

Pandas error : can only use .str accessor with string values which use no.object dtype

My input file is a Json file
{ "infile":"c:/tmp/cust-in-sample.xlsx",
"SheetName":"Sheet1",
"CleanColumns":[1,2],
"DeleteColumns":[3,5],
"outfile":"c:/tmp/out-cust-in-sample.csv"
}
I would like to have specified columns in json to cleaned and deleted. However I'm getting the pandas string error.
I'm currently trying this code:
import json
import pandas as pd
import gzip
import shutil
import sys
zJsonFile = sys.argv[-1]
iCount = len(sys.argv)
if iCount == 2:
print "json file path " ,zJsonFile
else:
print "need a json file path ending the script"
sys.exit()
with open(zJsonFile,'rb') as zTestJson:
decoded = json.load(zTestJson)
#Parameterizing the code, reading each key from 'decoded' variable and putting it into another variable for the purpose
#of parameterizing
Infile = decoded.get('infile')
print Infile
Outfile = decoded.get('outfile')
print Outfile
Sheetname = decoded.get('SheetName')
print Sheetname
# this is a list
deletecols = decoded.get('DeleteColumns')
print deletecols
#this is a list
cleancols = decoded.get('CleanColumns')
print cleancols
input_sheet = pd.ExcelFile(Infile)
dfs = {}
for x in [Sheetname]:
dfs[x] = input_sheet.parse(x)
print dfs
df = pd.DataFrame(dfs[x]) # COnverting dict to dataframe
print df
deletecols = df.columns.values.tolist()
cleancols = df.columns.values.tolist()
for idx,item in enumerate(deletecols):
df.pop(item)
#df.drop(df.columns[deletecols],axis=1,inplace=True)
#Cleaning the code
#cleancols=[]
for x in cleancols:
df[x] = df[x].str.replace(to_replace = '"', value = '', regex = True)
df[x] = df[x].str.replace(to_replace = "'", value = '', regex = True)
df[x] = df[x].str.replace(to_replace = ",", value = '', regex = True)
I tried df.pop, df.drop nothing of this looks like its working for me and neither creating a loop and looping through cleancols is cleaning my file.
Any help is highly appreciated.!

Categories

Resources