Flattened JSON Repeating Columns Pandas - python

I'm having an issue currently with pulling from an API, getting a JSON dict, then flattening it, and placing it into a dataframe.
The data is structured like this from the json:
X1_0, X2_0, X3_0 ... X1_1, X2_1, X2_1, ... X1_2, X2_2, X2_3
and when I flatten it and place into a dataframe I get each flattened key as an individual column header rather than all combined since they have they _#.
So rather than getting something that's shape is 22 x 6 I get something that would be like 1 x 130.
I'm basically just interested in getting the shape of the dataframe correct but I'm not sure how I should fix it, and whether it should be done before flattening or after?
Any help is appreciated

Try to strip the _# from the keys. My guess is that they are added in the flattening step, so it should be easy to get rid of them.
Now you will get several values per key. Fix this by creating a list of JSON objects where each contains all the values for each key with the same _#.

Related

Pandas store column value queried by index as a list

Hi I am trying to store the value I printed by queried the data frame using index, I tried use .tolist and list but I cannot store them. If there any way I can store these values and make the two query results together in one data frame?
Thank you in advance!
I tried use .tolist and list but I cannot store them.

Extracting data from a dict like column in a dataframe

I have column info in a dataframe with data in dict like format like the one below.
I would like to get another dataframe with this info and a tried:
feature = [d.get('Feature') for d in df['info']]
but it returns none.
How can I do it? I am really having a bad time trying to get this done.
As the dict is nested, you can try pd.json_normalize() which normalizes semi-structured JSON data into a flat table:
df_new = pd.json_normalize(df['info'])
As some inner dict are further under a list, you may need to further handle this to dig out the deeper contents. Anyway, this should serve well as a starting point for your works.

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

How to only get only errors from insert_rows_from_dataframe method in Bigquery Client?

I am using client.insert_rows_from_dataframe method to insert data into my table.
obj = client.insert_rows_from_dataframe(table=TableRef, dataframe=df)
If there is no errors, obj will be an empty list of lists like
> print(obj)
[[] [] []]
But I want to know how to get the error messages out, if there are some errors while inserting?
I tried
obj[["errors"]] ?
but that is not correct. Please help.
To achieve the results that you want, you must set to your DataFrame a header identical to the one in your schema. For example, if you schema in BigQuery has the fields index and name, your DataFrame should have these two columns.
Lets take a look in the example below:
I created an table in BigQuery named insert_from_dataframe which contains the fields index, name and number, respectively INTEGER, STRING and INTEGER, all of them REQUIRED.
In the image below you can see that the insertion cause no errors.In the second image, we can see that the data was inserted.
No erros raised
Data inserted successfully
After that, I removed the column number for the last row of the same data. As you can see below, when I tried to push it to BigQuery, I got an error.
Given that, I would like to reinforce two points:
The error structured that is returned is a list of lists ( [],[],[],...]). The reason for that is because your data is supposed to be pushed in chunks (subsets of your data). In the function used you can specify how many rows each chunk will have using the parameter chunk_size=<number_of_rows>. Lets suppose that your data has 1600 rows and your chunk size is 500. You data will be divided into 4 chunks. The object returned after the insert request, hence, will consist of 4 lists inside a list, where each one of the four lists is related to one chunk. Its also important to say that if a row fails the process, all the rows inside the same chunk will not be inserted in the table.
If you are using string fields you should pay attention in the data inserted. Sometimes Pandas read null values as empty strings and it leads to a misinterpretation of the data by the insertion mechanism. In other words, its possible that you have empty strings inserted in your table while the expected result would be an error saying that the field can not be null.
Finally, I would like to post here some useful links for this problem:
BigQuery client documentation
Working with missing values in Pandas
I hope it helps.

Mutable indexed heterogeneous data structure?

Is there a data class or type in Python that matches these criteria?
I am trying to build an object that looks something like this:
ExperimentData
ID 1
sample_info_1: character string
sample_info_2: character string
Dataframe_1: pandas data frame
Dataframe_2: pandas data frame
ID 2
(etc.)
Right now, I am using a dict to hold the object ('ExperimentData'), which containsnamedtuple's for each ID. Each of the namedtuple's has a named field for the corresponding data attached to the sample. This allows me to keep all the ID's indexed, and have all of the fields under each ID indexed as well.
However, I need to update and/or replace the entries under each ID during downstream analysis. Since a tuple is immutable, this does not seem to be possible.
Is there a better implementation of this?
You could use a dict of dicts instead of a dict of namedtuples. Dicts are mutable, so you'll be able to modify the inner dicts.
Given what you said in the comments about the structures of each DataFrame-1 and -2 being comparable, you could also group all of each into one big DataFrame, by adding a column to each DataFrame containing the value of sample_info_1 repeated across all rows, and likewise for sample_info_2. Then you could concat all the DataFrame-1s into a big one, and likewise for the DataFrame-2s, getting all your data into two DataFrames. (Depending on the structure of those DataFrames, you could even join them into one.)

Categories

Resources