ValueError: arrays must all be same length - print dataframe to CSV - python

thanks for stopping by! I was hoping to get some help creating a csv using pandas dataframe. Here is my code:
a = ldamallet[bow_corpus_new[:21]]
b = data_text_new
print(a)
print("/n")
print(b)
d = {'Preprocessed Document': b['Preprocessed Document'].tolist(),
'topic_0': a[0][1],
'topic_1': a[1][1],
'topic_2': a[2][1],
'topic_3': a[3][1],
'topic_4': a[4][1],
'topic_5': a[5][1],
'topic_6': a[6][1],
'topic_7': a[7][1],
'topic_8': a[8][1],
'topic_9': a[9][1],
'topic_10': a[10][1],
'topic_11': a[11][1],
'topic_12': a[12][1],
'topic_13': a[13][1],
'topic_14': a[14][1],
'topic_15': a[15][1],
'topic_16': a[16][1],
'topic_17': a[17][1],
'topic_18': a[18][1],
'topic_19': a[19][1]}
print(d)
df = pd.DataFrame(data=d)
df.to_csv("test.csv", index=False)
The data:
print(a): the format is in tuples
[[(topic number: 0, topic percentage),...(19, #)], [(topic distribution for next row, #)...(19, .819438),...(#,#),...]
print(b)
Here is my error:
This is the size of the dataframe:
This is what I wished it looked like:
Any help would be greatly appreciated :)

It might be easiest to get the second value of each tuple for all of the rows in it's own list. Something like this
topic_0=[]
topic_1=[]
topic_2=[]
...and so on
for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])
...and so on
Then you can make your dictionary like so
d = {'Preprocessed Document': b['Preprocessed Document'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
etc. }

I took #mattcremeens advice and it worked. I've posted the full code below. He was right about nixing the tuples my previous code wasn't iterating through the rows but only printed the first row.
topic_0=[]
topic_1=[]
topic_2=[]
topic_3=[]
topic_4=[]
topic_5=[]
topic_6=[]
topic_7=[]
topic_8=[]
topic_9=[]
topic_10=[]
topic_11=[]
topic_12=[]
topic_13=[]
topic_14=[]
topic_15=[]
topic_16=[]
topic_17=[]
topic_18=[]
topic_19=[]
for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])
topic_3.append(i[3][1])
topic_4.append(i[4][1])
topic_5.append(i[5][1])
topic_6.append(i[6][1])
topic_7.append(i[7][1])
topic_8.append(i[8][1])
topic_9.append(i[9][1])
topic_10.append(i[10][1])
topic_11.append(i[11][1])
topic_12.append(i[12][1])
topic_13.append(i[13][1])
topic_14.append(i[14][1])
topic_15.append(i[15][1])
topic_16.append(i[16][1])
topic_17.append(i[17][1])
topic_18.append(i[18][1])
topic_19.append(i[19][1])
d = {'Preprocessed Document': b['Preprocessed Document'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
'topic_2': topic_2,
'topic_3': topic_3,
'topic_4': topic_4,
'topic_5': topic_5,
'topic_6': topic_6,
'topic_7': topic_7,
'topic_8': topic_8,
'topic_9': topic_9,
'topic_10': topic_10,
'topic_11': topic_11,
'topic_12': topic_12,
'topic_13': topic_13,
'topic_14': topic_14,
'topic_15': topic_15,
'topic_16': topic_16,
'topic_17': topic_17,
'topic_18': topic_18,
'topic_19': topic_19}
df = pd.DataFrame(data=d)
df.to_csv("test.csv", index=False, mode = 'a')

Related

Dividing each column in a pandas df by a value from another df

I have a dataframe of a size (44,44) and another one (44,)
I need to divide each item in a column 'EOFx' by a number in a column 'PCx'.
(e.g. All values in 'EOF1' by 'PC1')
I've been trying string and numeric loops but nothing seems to work at all (error) or I get NaNs.
Last thing I tried was
for k in eof_df.keys():
for m in pc_df.keys():
eof_df[k].divide(pc_df[m])
The end result is a modified eof_df.
What did work for 1 column outside the loop is this.
eof_df.iloc[:,0].divide(std_df.iloc[0]).head()
Thank you!
upd1. In response to MoRe:
for eof_df it will be:
{'EOF1': {'8410140.nc': -0.09481700372712784,
'8418150.nc': -0.11842440098461708,
'8443970.nc': -0.1275311990493338,
'8447930.nc': -0.1321116945944401,
'8449130.nc': -0.11649753033608201,
'8452660.nc': -0.14776686151828214,
'8454000.nc': -0.1451132595405897,
'8461490.nc': -0.17032364516557338,
'8467150.nc': -0.20725618455428937,
'8518750.nc': -0.2249648853806308},
'EOF2': {'8410140.nc': 0.051213689088367806,
'8418150.nc': 0.0858110390036938,
'8443970.nc': 0.09029173023479754,
'8447930.nc': 0.05526955432871537,
'8449130.nc': 0.05136680082838883,
'8452660.nc': 0.06105351220962777,
'8454000.nc': 0.052112043784544135,
'8461490.nc': 0.08652511173850089,
'8467150.nc': 0.1137754089944319,
'8518750.nc': 0.10461193696203},
and it goes to EOF44.
For pc_df it will be
{'PC1': 0.5734671652560537,
'PC2': 0.29256502033278076,
'PC3': 0.23586098119374838,
'PC4': 0.227069130368915,
'PC5': 0.1642170373016029,
'PC6': 0.14131097046499339,
'PC7': 0.09837935104899741,
'PC8': 0.0869056762311067,
'PC9': 0.08183389338415169,
'PC10': 0.07467191608481094}
output = pd.DataFrame(index=eof_df.index, data=eof_df.values / pc_df.values)
output.columns = eof_df.columns
data = pd.DataFrame(eof_df.values.T / pc_df.values.T).T
data.columns = ["divided" + str(i + 1) for i in data.columns.to_list()]

Normalize json column and join with rest of dataframe

This is my first question here on stackoverflow so please don't roast me.
I was trying to find similar problems on the internet and actually there are several, but for me the solutions didn't work.
I have created this dataframe:
import pandas as pd
from ast import literal_eval
d = {'order_id': [1], 'email': ["hi#test.com"], 'line_items': ["[{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]"]}
orders = pd.DataFrame(data=d)
It looks like this:
order_id email line_items
1 hi#test.com [{'sku':'testproduct1', 'quantity':'2'},{'sku':'testproduct2','quantity':'2'}]
I want the dataframe to look like this:
order_id email line_items.sku line_items.quantity
1 hi#test.com testproduct1 2
1 hi#test.com testproduct2 2
I used the following code to change the type of line_items from string to dict:
orders.line_items = orders.line_items.apply(literal_eval)
Normally I would use json_normalize now to flatten the line_items column. But I also want to keep the id and don't know how to do that. I also want to avoid any loops.
Is there anyone who can help me with this issue?
Kind regards
joant95
If your dictionary really is that strange, then you could try:
d['line_items'] = eval(d['line_items'][0])
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
To create d out of orders you could try:
d = orders.to_dict(orient='list')
Or you could try:
orders.line_items = orders.line_items.map(eval)
d = orders.to_dict(orient='records')
df = pd.json_normalize(d, record_path=['line_items'], meta=['order_id', 'email'])
But: I still don't have a clear picture of the situation :)

How to create Dynamic Dataframe in Pandas

lst = ['SymbolA','SymbolB', 'SymbolC' .... 'SymbolN']
I want to create dynamic Dataframe in Python Pandas.
for i in lst:
data = SomeFunction(lst[i]) # This will return dataframe of 10 x 100
lst[i]+str(i) = pd.DataFrame(data)
pd.Concat(SymbolA1,SymbolB1,SymbolC1,SymbolD1)
Anyone can help on how to create the dataframe dynamically to achieve as per the requirements?
I hope this will help, as i understood from this.
gbl = globals()
lst = ['SymbolA','SymbolB', 'SymbolC' .... 'SymbolN']
for i in lst:
data = SomeFunction(lst[i])
gbl[lst[i]+str(i)] = pd.Dataframe(data)
this will create a df dynamically . for accessing those df you need to run code like this.
gbl[lst[i]+str(i)]
try this..
You input has to be like below:
lst = ({'data':['SymbolA','SymbolB', 'SymbolC', 'SymbolN']})
print pd.DataFrame(lst)

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

DataFrame constructor not properly called! error

I am new to Python and I am facing problem in creating the Dataframe in the format of key and value i.e.
data = [{'key':'\[GlobalProgramSizeInThousands\]','value':'1000'},]
Here is my code:
columnsss = ['key','value'];
query = "select * from bparst_tags where tag_type = 1 ";
result = database.cursor(db.cursors.DictCursor);
result.execute(query);
result_set = result.fetchall();
data = "[";
for row in result_set:
`row["tag_expression"]`)
data += "{'value': %s , 'key': %s }," % ( `row["tag_expression"]`, `row["tag_name"]` )
data += "]" ;
df = DataFrame(data , columns=columnsss);
But when I pass the data in DataFrame it shows me
pandas.core.common.PandasError: DataFrame constructor not properly called!
while if I print the data and assign the same value to data variable then it works.
You are providing a string representation of a dict to the DataFrame constructor, and not a dict itself. So this is the reason you get that error.
So if you want to use your code, you could do:
df = DataFrame(eval(data))
But better would be to not create the string in the first place, but directly putting it in a dict. Something roughly like:
data = []
for row in result_set:
data.append({'value': row["tag_expression"], 'key': row["tag_name"]})
But probably even this is not needed, as depending on what is exactly in your result_set you could probably:
provide this directly to a DataFrame: DataFrame(result_set)
or use the pandas read_sql_query function to do this for you (see docs on this)
Just ran into the same error, but the above answer could not help me.
My code worked fine on my computer which was like this:
test_dict = {'x': '123', 'y': '456', 'z': '456'}
df=pd.DataFrame(test_dict.items(),columns=['col1','col2'])
However, it did not work on another platform. It gave me the same error as mentioned in the original question. I tried below code by simply adding the list() around the dictionary items, and it worked smoothly after:
df=pd.DataFrame(list(test_dict.items()),columns=['col1','col2'])
Hopefully, this answer can help whoever ran into a similar situation like me.
import json
# Opening JSON file
f = open('data.json')
# returns JSON object as
# a dictionary
data1 = json.load(f)
#converting it into dataframe
df = pd.read_json(data1, orient ='index')

Categories

Resources