rbindlist equivalent R's function in python - python

First i am creating empty lists based on length of the num_vars and storing the output of each loop in one list.
After that I want to combine the all the outputs and convert that as pandas data frame.
for this we can simply use rbindlist in R, for combine the list objects.
for that i used the following python code.
ests_list=[[] for i in range(num_vars)]
for i in list(range(0,num_vars)):
for j in list(range(1,num_vars+1))
ests_list[i]=pd.merge(df1,
df2,
how='left',
on=eval('combine%s'%j+'_lvl'))
pd.concat(ests_list)
when i tried the above syntax it throws the following error:
TypeError: cannot concatenate object of type "<class 'list'>"; only
pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Please anyone can help me to solve this issue.
Thanks in advance.

I found a solution for my problem:
ests_list=[]
for i in list(range(1,num_vars)):
ests_list.append(df1.merge(df2,how='left',on=eval("combine%s"%i+"_lvl")))
pd.concat(ests_list)
I am creating an empty list and and I added each loop output to it.
Then I am combining all the list by using the pd.concat function, so it gives me the output in pandas data frame format.

Related

Joining two dataframe of one column generated with spark

I'm working with pyspark and pandas in Databricks. I'm generating the two following dataframe:
start_date = datetime.today() - timedelta(days=60)
end_date = datetime.today()
date_list = pd.date_range(start=start_date,end=end_date).strftime('%Y-%m-%d').tolist()
date_df = spark.createDataFrame(date_list, 'string').toDF("date")
and
random_list = np.random.normal(loc=50, scale=10, size=61)
random_list = [round(i) for i in random_list]
integer_df = spark.createDataFrame(random_list, 'integer').toDF("value")
so I have two dataframes of one column each ("date" and "value") of the same length and I'd like to "merge" them into one dataframe.
I've tried this:
integer_df=pd.concat(date_df)
which is returning the following error first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
and this
test_df = pd.concat([integer_df, date_df], axis=1, join='inner')
which is returning the following error cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
Mostly I'd like to understand these errors.
From what i could see you are not transitioning the objects correctly, for example you are trying to concatenate a sparkdf object to a pandasdf object.
first argument must be an iterable of pandas-on-Spark objects, you passed an object of type "DataFrame"
This guy was caused because because, you passed the wrong type object. To concatenate. You should try using pandas on spark object or just pandas objects, if you are going to use pandas.
So to fix your first error, i would just follow the convention. Work with the objects of the given library.
Something like this (or maybe just use pd.Series() or pd.DataFrame)
date_df = spark.createDataFrame(date_list, 'string').toPandas()
# toDF("date") is redundant, either use createDataFrame or toDf not both
integer_df = spark.createDataFrame(random_list, 'integer').toPandas()
After that try utilizing pd.concat([]), with the give results.
Your second error, was caused because pandas has a given condition to only accept type Series object (something similar to your list), since you are passing a pyspark df well i guess pandas gets confused and read it as a list.
So to fix it again utilize the correct object of the library, or transform it to numpy if you want something more efficient
Hope this helps.

Python convert list of large json objects to DF

I want to convert a list of objects to a pandas dataframe. The objects are large and complex, a sample one can be found here
The output should be a DF with 3 columns: info, releases, URLs - as per the json object linked above. I've tried pd.DataFrame and from_records, but I keep getting hit with errors. Can anyone suggest a fix?
Have you tried pd.read_json() function?
Here's a link to the documentation
https://pandas.pydata.org/docs/reference/api/pandas.io.json.read_json.html

How to Convert Dask DataFrame Into List of Dictionaries?

I need to convert a dask dataframe into a list of dictionaries as the response for an API endpoint. I know I can convert the dask dataframe to pandas, and then from there I can convert to dictionary, but it would be better to map each partition to a dict, and then concatenate.
What I tried:
df = dd.read_csv(path, usecols=cols)
dd.compute(df.to_dict(orient='records'))
Error I'm getting:
AttributeError: 'DataFrame' object has no attribute 'to_dict'
You can do it as follows
import dask.bag as db
db.from_delayed(df.map_partitions(pd.DataFrame.to_dict, orient='records'
).to_delayed())
which gives you a bag which you could compute (if it fits in memory) or otherwise manipulate.
Note that to_delayed/from_delayed should not be necessary, there is also a to_bag method, but it doesn't seem to do the right thing.
Also, you are not really getting much from the dataframe model here, you may want to start with db.read_text and the builtin CSV module.
Try this:
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))
It will return a list of dictionaries wherein each row will be converted to the dictionary.
The answer of Kunal Bafna is easiest to implement, and has fewer dependencies.
data=list(df.map_partitions(lambda x:x.to_dict(orient="records")))

TypeError: cannot concatenate a ninNDFrame object while trying to concat

enter image description hereThis is my code and I am trying to concatenate on axis=0 or append but I get this error. All the columns are the same and even the datatypes are same.
Although when doing a pd.read_csv() I have to specify encoding=ISO-8859-1
Do you think that could be the reason for it?
Although the type is a dataframe
declinedataforclassification=declinedata[['amount_requested','Risk_Score','dti','State','emp_length','policy_code','app_month']]
loandataforclassification=loandata[['loan_amnt','Risk_Score','dti','addr_state','emp_length','policy_code','issue_month']]
loandataforclassification=loandataforclassification.rename(columns={'loan_amnt':'amount_requested','addr_state':'State','issue_month':'app_month'})
declinedataforclassification['status']=0
loandataforclassification['status']=1
loandataforclassification['amount_requested']=loandataforclassification['amount_requested'].astype(float)
resultdata = loandataforclassification.append('declinedataforclassification',ignore_index=True)
resultdata = loandataforclassification.append(declinedataforclassification,ignore_index=True)
should work for you. You're trying to append a string to a data frame right now.

Adding an np.array as a column in a pandas.DataFrame

I have a pandas data frame and a numpy nd array with one dimension. Effectively it is a list.
How do I add a new column to the DataFrame with the values from the array?
test['preds'] = preds gives SettingWithCopyWarning
And a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
And when I try pd.DataFrame({test,preds}) I get TypeError: unhashable type: 'list'
Thanks to EdChum the problem was this
test= DataFrame(test)
test['preds']=preds
It works!
This is not a pandas error, this error is because you are trying to instantiate a set with two lists.
{test,preds}
#TypeError: unhashable type: 'list'
A set is a container which needs all its content to be hashable, since sets may not contain the same element twice.
That being said, handing pandas a set will not work for your desired result.
Handing pandas a dict however, will work, like this:
pd.DataFrame({"test":test,"preds":preds})

Categories

Resources