I'm trying to perform multiple sums on the same dataframe and then concatenate the new dataframes into one final dataframe. Is there a concise way of doing this, or do I need use iteration ?
I have a dict of this form {key: [list_of_idx], ...} and need to groupby my dataframe for each key.
Sample data
import random
random.seed(1)
df_len = 5
df = {'idx':{i: i+1 for i in range(df_len)}, 'data':{i:random.randint(1,11) for i in range(df_len)}}
df = pd.DataFrame(df).set_index('idx')
# Groups with the idx to groupby
groups = {'a': [1,2,3,4,5],
'b': [1,4],
'c': [5]}
# I'm trying to avoid/find a faster way than this
dfs = []
for grp in groups:
_df = df.loc[groups[grp]]
_df['grp'] = grp
_df = _df.groupby('grp').sum()
dfs.append(_df)
dff = pd.concat(dfs)
Input (df)
data idx
0 2 1
1 10 2
2 9 3
3 3 4
4 6 5
Expected output (dff)
data
grp
a 30
c 6
b 5
Note : I'm stuck with python 2.7 and pandas 0.16.1
Time result
I tested the proposed methods and calculate the time of execution. I show the mean time per execution (using 1000 executions for each answer):
I couln't test Quang Hoang first answer, because of my pandas version.
time method
0.00696 sec my method (question)
0.00328 sec piRSquared (pd.concat)
0.00024 sec piRSquared (collections and defaultdict)
0.00444 sec Quang Hoang (2nd method : concat + reindex)
This should be (quite) a bit faster:
s = pd.Series(groups).explode()
df.reindex(s).groupby(s.index)['data'].sum()
Output:
a 30
b 5
c 6
Name: data, dtype: int64
Update: similar approach for earlier pandas version, although it might not be as fast
s = pd.concat([pd.DataFrame({'grp':a, 'idx':b}) for a,b in groups.items()],
ignore_index=True).set_index('grp')
df.reindex(s.idx).groupby(s.index)['data'].sum()
Clever use of pd.concat
pd.concat({k: df.loc[v] for k, v in groups.items()}).sum(level=0)
data
a 22
b 8
c 2
NOTE: This magically works for all columns.
Suppose we have more_data
import random
random.seed(1)
df_len = 5
df = {
'idx':{i: i+1 for i in range(df_len)},
'data':{i:random.randint(1,11) for i in range(df_len)},
'more_data':{i:random.randint(1,11) for i in range(df_len)},
}
df = pd.DataFrame(df).set_index('idx')
Then
pd.concat({k: df.loc[v] for k, v in groups.items()}).sum(level=0)
data more_data
a 22 42
b 8 19
c 2 7
But I'd stick with more Python: collections.defaultdict
from collections import defaultdict
results = defaultdict(int)
for k, V in groups.items():
for v in V:
results[k] += df.at[v, 'data']
pd.Series(results)
a 22
b 8
c 2
dtype: int64
For this to work with multiple columns, I have to set up the defaultdict a tad differently:
from collections import defaultdict
results = defaultdict(lambda: defaultdict(int))
for k, V in groups.items():
for v in V:
for c in df.columns:
results[c][k] += df.at[v, c]
pd.DataFrame(results)
data more_data
a 22 42
b 8 19
c 2 7
This is what it would look like without defaultdict but using method setdefault from the dict object instead.
results = {}
for k, V in groups.items():
for v in V:
for c in df.columns:
results.setdefault(c, {})
results[c].setdefault(k, 0)
results[c][k] += df.at[v, c]
pd.DataFrame(results)
data more_data
a 22 42
b 8 19
c 2 7
Related
Below, I have a dictionary called 'date_dict'. I want to create a DataFrame that takes each key of this dictionary, and have it appear in n rows of the DataFrame, n being the value. For example, the date '20220107' would appear in 75910 rows. Would this be possible?
{'20220107': 75910,
'20220311': 145012,
'20220318': 214286,
'20220325': 283253,
'20220401': 351874,
'20220408': 419064,
'20220415': 486172,
'20220422': 553377,
'20220429': 620635,
'20220506': 684662,
'20220513': 748368,
'20220114': 823454,
'20220520': 886719,
'20220527': 949469,
'20220121': 1023598,
'20220128': 1096144,
'20220204': 1167590,
'20220211': 1238648,
'20220218': 1310080,
'20220225': 1380681,
'20220304': 1450031}
Maybe this could help.
import pandas as pd
myDict = {'20220107': 3, '20220311': 4, '20220318': 5 }
wrkList = []
for k, v in myDict.items():
for i in range(v):
rowList = []
rowList.append(k)
wrkList.append(rowList)
df = pd.DataFrame(wrkList)
print(df)
'''
R e s u l t
0
0 20220107
1 20220107
2 20220107
3 20220311
4 20220311
5 20220311
6 20220311
7 20220318
8 20220318
9 20220318
10 20220318
11 20220318
'''
I am having brain fart. I wrote some code to get keywords from my data frame. It worked, but how can I put the print information into my current data frame. Thank you for the help in advance.
from scipy.sparse import coo_matrix
def sort_coo(coo_matrix):
tuples = zip(coo_matrix.col, coo_matrix.data)
return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
"""get the feature names and tf-idf score of top n items"""
#use only topn items from vector
sorted_items = sorted_items[:topn]
score_vals = []
feature_vals = []
# word index and corresponding tf-idf score
for idx, score in sorted_items:
#keep track of feature name and its corresponding score
score_vals.append(round(score, 3))
feature_vals.append(feature_names[idx])
#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
results[feature_vals[idx]]=score_vals[idx]
return results
#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())
#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,5)
#now print the results - NEED TO PUT THIS INFORMATION IN MY CURRENT DATAFRAME
print("\nAbstract:")
print(doc)
print("\nKeywords:")
for k in keywords:
print(k,keywords[k])
First: DataFrame is not Excel so it may not look like you may expect.
You can use append() to add new row with text. It should automatically add NaN if row is shorted. OR it will add columns with NaN if row is longer.
import pandas as pd
data = {
'X': ['A','B','C'],
'Y': ['D','E','F'],
'Z': ['G','H','I']
}
df = pd.DataFrame(data)
print(df)
df = df.append({"X": 'Abstract:'}, ignore_index=True)
df = df.append({"X": 'Keywords:'}, ignore_index=True)
keywords = {"first": 123, "second": 456, "third": 789}
for key, value in keywords.items():
df = df.append({"X": key, "Y": value}, ignore_index=True)
print(df)
Result:
# Before
X Y Z
0 A D G
1 B E H
2 C F I
# After
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract: NaN NaN
4 Keywords: NaN NaN
5 first 123 NaN
6 second 456 NaN
7 third 789 NaN
Eventually later you can replace NaN with something else - ie. empty string:
df = df.fillna('')
Result:
X Y Z
0 A D G
1 B E H
2 C F I
3 Abstract:
4 Keywords:
5 first 123
6 second 456
7 third 789
I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step
Simple dictionary:
d = {'a': set([1,2,3]), 'b': set([3, 4])}
(the sets may be turned into lists if it matters)
How do I convert it into a long/tidy DataFrame in which each column is a variable and every observation is a row, i.e.:
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
The following works, but it's a bit cumbersome:
id = 0
tidy_d = {}
for l, vs in d.items():
for v in vs:
tidy_d[id] = {'letter': l, 'value': v}
id += 1
pd.DataFrame.from_dict(tidy_d, orient = 'index')
Is there any pandas magic to do this? Something like:
pd.DataFrame([d]).T.reset_index(level=0).unnest()
where unnest obviously doesn't exist and comes from R.
You can use a comprehension with itertools.chain and zip:
from itertools import chain
keys, values = map(chain.from_iterable, zip(*((k*len(v), v) for k, v in d.items())))
df = pd.DataFrame({'letter': list(keys), 'value': list(values)})
print(df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
This can be rewritten in a more readable fashion:
zipper = zip(*((k*len(v), v) for k, v in d.items()))
values = map(list, map(chain.from_iterable, zipper))
df = pd.DataFrame(list(values), columns=['letter', 'value'])
Use numpy.repeat with chain.from_iterable:
from itertools import chain
df = pd.DataFrame({
'letter' : np.repeat(list(d.keys()), [len(v) for k, v in d.items()]),
'value' : list(chain.from_iterable(d.values())),
})
print (df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
A tad more "pandaic", inspired by this post:
pd.DataFrame.from_dict(d, orient = 'index') \
.rename_axis('letter').reset_index() \
.melt(id_vars = ['letter'], value_name = 'value') \
.drop('variable', axis = 1) \
.dropna()
Some timings of melt and slightly modified chain answers:
import random
import timeit
from itertools import chain
import pandas as pd
print(pd.__version__)
dict_size = 1000000
randoms = [random.randint(0, 100) for __ in range(10000)]
max_list_size = 1000
d = {k: random.sample(randoms, random.randint(1, max_list_size)) for k in
range(dict_size)}
def chain_():
keys, values = map(chain.from_iterable,
zip(*(([k] * len(v), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
def melt_():
pd.DataFrame.from_dict(d, orient='index'
).rename_axis('letter').reset_index(
).melt(id_vars=['letter'], value_name='value'
).drop('variable', axis=1).dropna()
setup ="""from __main__ import chain_, melt_"""
repeat = 3
numbers = 10
def timer(statement, _setup=''):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
print('timing')
timer('chain_()')
timer('melt_()')
Seems melt is faster for max_list_size 100:
1.0.3
timing
246.71311019999996
204.33705529999997
and slower for max_list_size 1000:
2675.8446872
4565.838648400002
probably because of assigning the memory for a much bigger df than needed
A variation of chain answer:
def chain_2():
keys, values = map(chain.from_iterable,
zip(*((itertools.repeat(k, len(v)), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
doesn't seem to be any faster
(python 3.7.6)
Just another one,
from collections import defaultdict
e = defaultdict(list)
for key, val in d.items():
e["letter"] += [key] * len(val)
e["value"] += list(val)
df = pd.DataFrame(e)
I have a dictionary D that contains many dataframes.
I can access every dataframe with D[0], D[1]...D[i], with the integers as keys/identifier of the respective dataframe.
I now want to concat all the dataframes in this fashion into a new dataframe:
new_df = pd.concat([D[0],D[1],...D[i]], axis= 1)
What would you suggest how I can solve this (concat needs still to be used)?
I tried with generating a list of D's and including this but received an error message.
I think the easiest thing to do is to use a dict comprehension of the dict items:
In [14]:
d = {'a':pd.DataFrame(np.random.randn(5,3), columns=list('abc')), 'b':pd.DataFrame(np.random.randn(5,3), columns=list('def'))}
d
Out[14]:
{'a': a b c
0 0.030358 1.523752 1.040409
1 -0.220019 -1.579467 -0.312059
2 1.019489 -0.272261 1.182399
3 0.580368 1.985362 -0.835338
4 0.183974 -1.150667 1.571003, 'b': d e f
0 -0.911246 0.721034 -0.347018
1 0.483298 -0.553996 0.374566
2 -0.041415 -0.275874 -0.858687
3 0.105171 -1.509721 0.265802
4 -0.788434 0.648109 0.688839}
In [29]:
pd.concat([df for k,df in d.items()], axis=1)
Out[29]:
a b c d e f
0 0.030358 1.523752 1.040409 -0.911246 0.721034 -0.347018
1 -0.220019 -1.579467 -0.312059 0.483298 -0.553996 0.374566
2 1.019489 -0.272261 1.182399 -0.041415 -0.275874 -0.858687
3 0.580368 1.985362 -0.835338 0.105171 -1.509721 0.265802
4 0.183974 -1.150667 1.571003 -0.788434 0.648109 0.688839