I have below line in my data pipeline code which takes json array and normalizes it using pandas.json_normalize
df = pd.json_normalize(reviews, sep='_')
Now when reviews is getting null or None, it has suddenly started failing. What should be done here?
I tried writing all the data that review receives in a for loop, and from that I understood, this failure occurs only when review receives null
Which version of pandas are you using?
If you get the error AttributeError: module 'pandas' has oo attribute 'json_normalize' after inserting from pandas import json_normalize it may be due to the version you are using.
You have to downgrade the pandas to the version before 1.0.3. Since you need to import the json_normalize module from the pandas package directly into newer versions from pandas, import json_normalize instead.
After that you can try out something like:
from pandas.io.json import json_normalize
data = { "xy": ["1","2","3"] }
json = json_normalize(data)
Related
I use the modin library for multiprocessing.
While the library is great for faster processing, it fails at merge and I would like to revert to default pandas in between the code.
I understand as per PEP 8: E402 conventions, import should be declared once and at the top of the code however my case would need otherwise.
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df = mpd.read_csv()
do stuff
Then I would like to revert to default pandas within the same code
but how would i do the below in pandas as there does not seem to be a clear way to switch from pd and mpd in the below lines and unfortunately modin seems to take precedence over pandas.
df = df.loc[:, df.columns.intersection(['col1', 'col2'])]
df = df.drop_duplicates()
df = df.sort_values(['col1', 'col2'], ascending=[True, True])
Is it possible?
if yes, how?
You can simply do the following :
import modin.pandas as mpd
import pandas as pd
This way you have both modin as well as original pandas in memory and you can efficiently switch as per your need.
Since many have posted answers however in this particular case, as applicable and pointed out by #Nin17 and this comment from Modin GitHub, to convert from Modin to Pandas for single core processing of some of the operations like df.merge you can use
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df_modin = mpd.read_csv() #reading dataframe into Modin for parallel processing
df_pandas = df_modin._to_pandas() #converting Modin Dataframe into pandas for single core processing
and if you would like to reconvert the dataframe to a modin dataframe for parallel processing
df_modin = mpd.DataFrame(df_pandas)
You can try pandarallel package instead of modin , It is based on similar concept : https://pypi.org/project/pandarallel/#description
Pandarallel Benchmarks : https://libraries.io/pypi/pandarallel
As #Nin17 said in a comment on the question, this comment from the Modin GitHub describes how to convert a Modin dataframe to pandas. Once you have a pandas dataframe, you call any pandas method on it. This other comment from the same issue describes how to convert the pandas dataframe back to a Modin dataframe.
How can I get this json file in a python dataframe? https://data.cdc.gov/resource/8xkx-amqh.json
I tried to read the data using socrata and was working. However it has a limit and I need the whole data.
That's what I have:
client = Socrata("data.cdc.gov", app_token=None)
# First 5000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
vcounty = client.get_all("8xkx-amqh", limit=5000)
# Convert to pandas DataFrame
vcounty_df = pd.DataFrame.from_records(vcounty)
But I want the whole data and for what I understand Socrata has a limit which is less than what I need.
API is limited for unauthorized users but you can download all data in csv format and convert them to dataframe. There are 1.5+ millions rows.
# pip install requests
# pip install pandas
import requests
import pandas as pd
import io
urlData = requests.get('https://data.cdc.gov/api/views/8xkx-amqh/rows.csv?accessType=DOWNLOAD').content
df = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
df
Returns
I read in the JSON file as Pandas data frame. Now I want to print the JSON object schema. I look around and mostly I only saw links to do this online but my file is too big (almost 11k objects/lines). I'm new at this so I was wondering is there a function/code that I can do this in python?
What I have so far...
import json
import pandas as pd
df = pd.read_json('/Users/name/Downloads/file.json', lines = True)
print(df)
I can't add a comment, but, maybe if you try to convert the df into json inside a variable and then print the variable.
you can if you use the pydantic and datamodel-code-generator libraries.
Use datamodel-code-generator to produce the Python model and then use pydantic to print out the schema from the model.
I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object
I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.
Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))
When I try to use from_csv method in python 3.7, I receive attribution error:
import pandas as pd
pd.DataFrame.from_csv(adr)
AttributeError: type object 'DataFrame' has no attribute 'from_csv'
How can I solve this problem?
from_csv is deprecated now. There are no further developments on this.
Its suggested to use pd.read_csv now.
import pandas as pd
df = pd.read_csv("your-file-path-here")
And python warning now says the same -
main:1: FutureWarning: from_csv is deprecated. Please use read_csv(...) instead. Note that some of the default arguments are different, so please refer to the documentation for from_csv when changing your function calls
import pandas as pd
df = pd.read_csv('<CSV_FILE>')
To read CSV file in a pandas dataframe you need to use the function read_csv. You may try the following code
import pandas as pd
pd.read_csv('adr.csv')
The following link will give you an idea about how to use pandas to read and write a CSV file.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html