Vectorize parsing json list in python / pandas - python

I am using python to get data from an API which returns a json which I need to parse for specific variables as below:
r=requests.get(request_here)
g=r.json()
N=len(g)
for i in range(0,N):
Submission_Timestamp=g[i]['vars']['submission']['timestamp']
Outcome=g[i]['appended']['suppressionlist']['add_item']['outcome']
Where each "i" is a unique record. While this method works I wanted to know if there is a way to vectorize this to grab everything in one swoop. Not all records will have the Outcome variable so I also need a KeyError exception for handling those records. Finally, I am pushing this into a pandas dataframe. How would I accomplish this? Thanks!

Related

When converting Python Object to Dataframe, output is different

I am pulling data from an api, converting it into a Python Object, and attempting to convert it to a dataframe. However, when I try to unpack Python Object, I am getting extra rows than are in the Python Object.
You can see in my dataframe how on 2023-02-03, there are multiple rows. One row seems to be giving me the correct data while the other row is giving me random data. I am not sure where the extra row is coming from. I'm wondering if it has something to do with the null values or whether I am not unpacking the Python Object correctly.
My code
I double checked the raw data from the JSON response and don't see the extra values there. On Oura's UI, I checked the raw data and didn't notice anything there either.
Here's an example of what my desired output would look like:
enter image description here
Can anyone identify what I might be doing wrong?

Python convert list of large json objects to DF

I want to convert a list of objects to a pandas dataframe. The objects are large and complex, a sample one can be found here
The output should be a DF with 3 columns: info, releases, URLs - as per the json object linked above. I've tried pd.DataFrame and from_records, but I keep getting hit with errors. Can anyone suggest a fix?
Have you tried pd.read_json() function?
Here's a link to the documentation
https://pandas.pydata.org/docs/reference/api/pandas.io.json.read_json.html

Extract pandas dataframe column names from query string

I have a dataset with a lot of fields, so I don't want to load all of it into a pd.DataFrame, but just the basic ones.
Sometimes, I would like to do some filtering upon loading and I would like to apply the filter via the query or eval methods, which means that I need a query string in the form of, i.e. "PROBABILITY > 10 and DISTANCE <= 50", but these columns need to be loaded in the dataframe.
Is is possible to extract the column names from the query string in order to load them from the dataset?
I know some magic using regex is possible, but I'm sure that it would break sooner or later, as the conditions get complicated.
So, I'm asking if there is a native pandas way to extract the column names from the query string.
I think you can use when you load your dataframe the term use cols I use it when I load a csv I dont know that is possible when you use a SQL or other format.
Columns_to use=['Column1','Column3']
pd.read_csv(use_cols=Columns_to_use,...)
Thank you

How to use parse from phonenumbers Python library on a pandas data frame?

How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python,
https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.
Love the solution of #katelie, but here's my code. Added a try/except block to skip the format_number function from failing. It cannot handle strings that are too long.
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)

Problem when stores dict into pandas Dataframe

Recently in my project, I need to store a dictionary into a Pandas DataFrame with the code
self.device_info.loc[i,'interference']=[temp_dict]
The device_info is a Pandas DataFrame. The temp_dict is a dictionary and I want it to be stored as an element in the DataFrame for future use. The square bracket is added to ensure there's no error when assigning.
I just found it today that with Pandas version 0.22.0, this code will pack the dictionary as a list and store it into the DataFrame. However, in the version of 0.24.2, this code directly stores the dictionary into Pandas DataFrame.
For example, say when i=0, after executing the code
with Pandas.version == '0.22.0'
type(self.device_info.loc[0,'interference'])
returns list, while Pandas.version == '0.24.2', this code returns a dict. From my perspective, I need a consistent performance that there is always a dictionary stored.
I am currently working on two PCs, one's home and one's at my office, and I cannot update the older version of pandas on my office PC. So I would be much appreciated if anyone can help me figure out why this happens.
Pandas has a from_dict method with many option which takes a dict as input and returns a DataFrame.
You can chose to infer type or force it (to str, for example).
Then, manipulating and appending dataframes is way easier as you won't ever again have dict object problem in that line or column.

Categories

Resources