Creating a DataFrame from elements of a dictionary - python

Below, I have a dictionary called 'date_dict'. I want to create a DataFrame that takes each key of this dictionary, and have it appear in n rows of the DataFrame, n being the value. For example, the date '20220107' would appear in 75910 rows. Would this be possible?
{'20220107': 75910,
'20220311': 145012,
'20220318': 214286,
'20220325': 283253,
'20220401': 351874,
'20220408': 419064,
'20220415': 486172,
'20220422': 553377,
'20220429': 620635,
'20220506': 684662,
'20220513': 748368,
'20220114': 823454,
'20220520': 886719,
'20220527': 949469,
'20220121': 1023598,
'20220128': 1096144,
'20220204': 1167590,
'20220211': 1238648,
'20220218': 1310080,
'20220225': 1380681,
'20220304': 1450031}

Maybe this could help.
import pandas as pd
myDict = {'20220107': 3, '20220311': 4, '20220318': 5 }
wrkList = []
for k, v in myDict.items():
for i in range(v):
rowList = []
rowList.append(k)
wrkList.append(rowList)
df = pd.DataFrame(wrkList)
print(df)
'''
R e s u l t
0
0 20220107
1 20220107
2 20220107
3 20220311
4 20220311
5 20220311
6 20220311
7 20220318
8 20220318
9 20220318
10 20220318
11 20220318
'''

Related

decode pandas columns (categorical data to metric data) by template (pivot table) with constraints

In df A and B are label encoded categories all belonging to a certain subset (typ).
This categories should now be encoded/decoded again ... into metric data ... taken from template
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0,1,2,3,0,1,2,3,0,2,2,2,3,3,2,3,1,1],
'B': [2,3,1,1,1,3,2,2,0,2,2,2,3,3,3,3,2,1],
'typ': [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]})
A and B should be decoded to metric(float) data from the templates pivot_A and pivot_B respectively. In the templates the headers are the values to replace, the indices are the conditions to match and the values are the new values:
pivot_A = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.A),
index = np.unique(df.typ))
pivot_B = pd.DataFrame(np.array([np.random.rand(9),np.random.rand(9),np.random.rand(9),np.random.rand(9)]).T,
columns=np.unique(df.B),
index = np.unique(df.typ))
pivot_B looks like:
In [5]: pivot_B
Out[5]:
0 1 2 3
type
1 0.326687 0.851405 0.830255 0.721817
2 0.496182 0.769574 0.083379 0.491332
3 0.442760 0.786503 0.593361 0.470658
4 0.100724 0.455841 0.485407 0.211383
5 0.989424 0.852057 0.530137 0.385900
6 0.413897 0.915375 0.708038 0.846020
7 0.548033 0.670561 0.900648 0.742418
8 0.077552 0.310529 0.156794 0.076186
9 0.463480 0.377749 0.876133 0.518022
pivot_A looks like:
In [6] pivot_A
Out[6]:
0 1 2 3
type
1 0.012808 0.128041 0.001279 0.320740
2 0.615976 0.736491 0.879216 0.842910
3 0.298637 0.828012 0.962703 0.736827
4 0.700053 0.115463 0.670091 0.638931
5 0.416262 0.633604 0.504292 0.983946
6 0.956872 0.129720 0.611625 0.682046
7 0.414579 0.062104 0.118168 0.265530
8 0.162742 0.952069 0.112400 0.837696
9 0.123151 0.061040 0.326437 0.380834
explained useage of pivots:
if df.typ == pivot.index and df.A == X:
df.A = pivot_A.loc[typ][X]
decoding could be done by:
for categorie in [i for i in df.columns if i != 'typ']:
for col in np.unique(df[categorie]):
for type_ in np.unique(df.typ):
df.loc[((df['typ']==type_) & (df[categorie]==col)), categorie] = locals()['pivot_{}'.format(categorie)].loc[type_,col]
and result in:
In[7] :df
Out[7]:
A B typ
0 0.012808 0.830255 1
1 0.736491 0.491332 2
2 0.962703 0.786503 3
3 0.638931 0.455841 4
4 0.416262 0.852057 5
5 0.129720 0.846020 6
6 0.118168 0.900648 7
7 0.837696 0.156794 8
8 0.123151 0.463480 9
9 0.001279 0.830255 1
10 0.879216 0.083379 2
11 0.962703 0.593361 3
12 0.638931 0.211383 4
13 0.983946 0.385900 5
14 0.611625 0.846020 6
15 0.265530 0.742418 7
16 0.952069 0.156794 8
17 0.061040 0.377749 9
BUT this looping seems NOT to be the best way doing it, right?!
How can I improve the code? pd.replace or dictionaries seem to be reasonable... but I can not figuere how to handle it with the extra typ conditions
melting down the 3xlooping process to one loop helps to reduce the duration time a lot:
old_values = list(pivot_A.columns) #from template
new_values_df = pd.DataFrame() #to save the decoded values without overwriting the oldvalues
for typ_ in pivot_A.index: #to match the condition (correct typ in every loop seperately)
new_values = list(pivot_A].loc[cl])
new_values_df = pd.concat([(df[df['typ']==typ]['A'].\
replace(old_values,new_values)).to_frame(A), new_values_df])

Converting a 1D list into a 2D DataFrame

I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step

Mutliple sums on the same dataframe

I'm trying to perform multiple sums on the same dataframe and then concatenate the new dataframes into one final dataframe. Is there a concise way of doing this, or do I need use iteration ?
I have a dict of this form {key: [list_of_idx], ...} and need to groupby my dataframe for each key.
Sample data
import random
random.seed(1)
df_len = 5
df = {'idx':{i: i+1 for i in range(df_len)}, 'data':{i:random.randint(1,11) for i in range(df_len)}}
df = pd.DataFrame(df).set_index('idx')
# Groups with the idx to groupby
groups = {'a': [1,2,3,4,5],
'b': [1,4],
'c': [5]}
# I'm trying to avoid/find a faster way than this
dfs = []
for grp in groups:
_df = df.loc[groups[grp]]
_df['grp'] = grp
_df = _df.groupby('grp').sum()
dfs.append(_df)
dff = pd.concat(dfs)
Input (df)
data idx
0 2 1
1 10 2
2 9 3
3 3 4
4 6 5
Expected output (dff)
data
grp
a 30
c 6
b 5
Note : I'm stuck with python 2.7 and pandas 0.16.1
Time result
I tested the proposed methods and calculate the time of execution. I show the mean time per execution (using 1000 executions for each answer):
I couln't test Quang Hoang first answer, because of my pandas version.
time method
0.00696 sec my method (question)
0.00328 sec piRSquared (pd.concat)
0.00024 sec piRSquared (collections and defaultdict)
0.00444 sec Quang Hoang (2nd method : concat + reindex)
This should be (quite) a bit faster:
s = pd.Series(groups).explode()
df.reindex(s).groupby(s.index)['data'].sum()
Output:
a 30
b 5
c 6
Name: data, dtype: int64
Update: similar approach for earlier pandas version, although it might not be as fast
s = pd.concat([pd.DataFrame({'grp':a, 'idx':b}) for a,b in groups.items()],
ignore_index=True).set_index('grp')
df.reindex(s.idx).groupby(s.index)['data'].sum()
Clever use of pd.concat
pd.concat({k: df.loc[v] for k, v in groups.items()}).sum(level=0)
data
a 22
b 8
c 2
NOTE: This magically works for all columns.
Suppose we have more_data
import random
random.seed(1)
df_len = 5
df = {
'idx':{i: i+1 for i in range(df_len)},
'data':{i:random.randint(1,11) for i in range(df_len)},
'more_data':{i:random.randint(1,11) for i in range(df_len)},
}
df = pd.DataFrame(df).set_index('idx')
Then
pd.concat({k: df.loc[v] for k, v in groups.items()}).sum(level=0)
data more_data
a 22 42
b 8 19
c 2 7
But I'd stick with more Python: collections.defaultdict
from collections import defaultdict
results = defaultdict(int)
for k, V in groups.items():
for v in V:
results[k] += df.at[v, 'data']
pd.Series(results)
a 22
b 8
c 2
dtype: int64
For this to work with multiple columns, I have to set up the defaultdict a tad differently:
from collections import defaultdict
results = defaultdict(lambda: defaultdict(int))
for k, V in groups.items():
for v in V:
for c in df.columns:
results[c][k] += df.at[v, c]
pd.DataFrame(results)
data more_data
a 22 42
b 8 19
c 2 7
This is what it would look like without defaultdict but using method setdefault from the dict object instead.
results = {}
for k, V in groups.items():
for v in V:
for c in df.columns:
results.setdefault(c, {})
results[c].setdefault(k, 0)
results[c][k] += df.at[v, c]
pd.DataFrame(results)
data more_data
a 22 42
b 8 19
c 2 7

Check if numbers are sequential according to another column?

I have a data frame that looks like this:
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D
And my numbers(integers) need to be sequential IF the value in the column "Names" is the same for both numbers: so for example, between 6 and 8 the numbers are not sequential but that is fine since the column "Names" changes from C to D. However, between 8 and 10 this is a problem since both rows have the same value "Names" but are not sequential.
I would like to do a code that returns the numbers missing that need to be added according to the logic explained above.
import itertools as it
import pandas as pd
df = pd.read_excel("booki.xlsx")
c1 = df['Numbers'].copy()
c2 = df['Names'].copy()
for i in it.chain(range(1,len(c2)-1), range(1,len(c1)-1)):
b = c2[i]
c = c2[i+1]
x = c1[i]
n = c1[i+1]
if c == b and n - x > 1:
print(x+1)
It prints the numbers that are missing but two times, so for the data frame in the example it would print:
9
9
but I would like to print only:
9
Perhaps it's some failure in the logic?
Thank you
you can use groupby('Names') and then shift to get the differences between following elements within each group, then pick only the ones that don't have -1 as a differnce, and print their following number.
try this:
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D"""), sep="\s+")
differences = df.groupby('Names', as_index=False).apply(lambda g: g['Numbers'] - g['Numbers'].shift(-1)).fillna(-1).reset_index()
missing_numbers = (df[differences != -1]['Numbers'].dropna()+1).tolist()
print(missing_numbers)
Output:
[9.0]
I'm not sure itertools is needed here. Here is one solution only using pandas methods.
Group the data according to Names column using groupby
Select the min and max from Numbers columns
Define an integer range from min to max
merge this value with the sub dataframe
Filter according missing values using isna
Return the filtered df
Optional : reindex the columns for prettier output with reset_index
Here the code:
df = pd.DataFrame({"Numbers": [0, 1, 2, 3, 4, 5, 6, 8, 10, 15],
"Names": ["A", "A", "B", "B", "C", "C", "C", "D", "D", "D"]})
def select_missing(df):
# Select min and max values
min_ = df.Numbers.min()
max_ = df.Numbers.max()
# Create integer range
serie = pd.DataFrame({"Numbers": [i for i in range(min_, max_ + 1)]})
# Merge with df
m = serie.merge(df, on=['Numbers'], how='left')
# Return rows not matching the equality
return m[m.isna().any(axis=1)]
# Group the data per Names and apply "select_missing" function
out = df.groupby("Names").apply(select_missing)
print(out)
# Numbers Names
# Names
# D 1 9 NaN
# 3 11 NaN
# 4 12 NaN
# 5 13 NaN
# 6 14 NaN
out = out[["Numbers"]].reset_index(level=0)
print(out)
# Names Numbers
# 1 D 9
# 3 D 11
# 4 D 12
# 5 D 13
# 6 D 14

get a random item from a group of rows in a xlsx file in python

I have a xlsx file, for example:
A B C D E F G
1 5 2 7 0 1 8
3 4 0 7 8 5 9
4 2 9 7 0 6 2
1 6 3 2 8 8 0
4 3 5 2 5 7 9
5 2 3 2 6 9 1
being my values (that are actually on an excel file).
I nedd to get random rows of it, but separeted for column D values.
You can note that column D has values that are 7 and values that are 2.
I need to get 1 random row of all the rows that have 7 on column D and 1 random row of all the rows that have 2 on column D.
And put the results on another xlsx file.
My expected output needs to be the content of line 0, 1 or 2 and the content of line 3, 4 or 5.
Can someone help me with that?
Thanks!
I've created the code to that. The code below assumes that the excel name is test.xlsx and resides in the same folder as where you run your code. It samples NrandomLines from each unique value in column D and prints that out.
import pandas as pd
import numpy as np
import random
df = pd.read_excel('test.xlsx') # read the excel
vals = df.D.unique() # all unique values in column D, in your case its only 2 and 7
idx = []
N = []
for i in vals: # loop over unique values in column D
locs = (df.D==i).values.nonzero()[0]
idx = idx + [locs] # save row index of every unique value in column D
N = N + [len(locs)] # save how many rows contain specific value in D
NrandomLines = 1 # how many random samples you want
for i in np.arange(len(vals)): # loop over unique values of D
for k in np.arange(NrandomLines): # loop how many random samples you want
randomRow = random.randint(0,N[i]-1) # create random sample
print(df.iloc[idx[i][randomRow],:]) # print out random row
With OpenPyXl, you can use Worksheet.iter_rows to iterate the worksheet rows.
You can use itertools.groupby to group the row according to the "D" column values.
To do that, you can create a small function to pick-up this value in a row:
def get_d(row):
return row[3].value
Then, you can use random.choice to choose a row randomly.
Putting all things togather, you can have:
def get_d(row):
return row[3].value
for key, group in itertools.groupby(rows, key=get_d):
row = random.choice(list(group))
print(row)

Categories

Resources