Converting a 1D list into a 2D DataFrame - python

I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12

If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)

You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step

Related

Creating a DataFrame from elements of a dictionary

Below, I have a dictionary called 'date_dict'. I want to create a DataFrame that takes each key of this dictionary, and have it appear in n rows of the DataFrame, n being the value. For example, the date '20220107' would appear in 75910 rows. Would this be possible?
{'20220107': 75910,
'20220311': 145012,
'20220318': 214286,
'20220325': 283253,
'20220401': 351874,
'20220408': 419064,
'20220415': 486172,
'20220422': 553377,
'20220429': 620635,
'20220506': 684662,
'20220513': 748368,
'20220114': 823454,
'20220520': 886719,
'20220527': 949469,
'20220121': 1023598,
'20220128': 1096144,
'20220204': 1167590,
'20220211': 1238648,
'20220218': 1310080,
'20220225': 1380681,
'20220304': 1450031}
Maybe this could help.
import pandas as pd
myDict = {'20220107': 3, '20220311': 4, '20220318': 5 }
wrkList = []
for k, v in myDict.items():
for i in range(v):
rowList = []
rowList.append(k)
wrkList.append(rowList)
df = pd.DataFrame(wrkList)
print(df)
'''
R e s u l t
0
0 20220107
1 20220107
2 20220107
3 20220311
4 20220311
5 20220311
6 20220311
7 20220318
8 20220318
9 20220318
10 20220318
11 20220318
'''

Check if numbers are sequential according to another column?

I have a data frame that looks like this:
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D
And my numbers(integers) need to be sequential IF the value in the column "Names" is the same for both numbers: so for example, between 6 and 8 the numbers are not sequential but that is fine since the column "Names" changes from C to D. However, between 8 and 10 this is a problem since both rows have the same value "Names" but are not sequential.
I would like to do a code that returns the numbers missing that need to be added according to the logic explained above.
import itertools as it
import pandas as pd
df = pd.read_excel("booki.xlsx")
c1 = df['Numbers'].copy()
c2 = df['Names'].copy()
for i in it.chain(range(1,len(c2)-1), range(1,len(c1)-1)):
b = c2[i]
c = c2[i+1]
x = c1[i]
n = c1[i+1]
if c == b and n - x > 1:
print(x+1)
It prints the numbers that are missing but two times, so for the data frame in the example it would print:
9
9
but I would like to print only:
9
Perhaps it's some failure in the logic?
Thank you
you can use groupby('Names') and then shift to get the differences between following elements within each group, then pick only the ones that don't have -1 as a differnce, and print their following number.
try this:
import pandas as pd
import numpy as np
from io import StringIO
df = pd.read_csv(StringIO("""
Numbers Names
0 A
1 A
2 B
3 B
4 C
5 C
6 C
8 D
10 D"""), sep="\s+")
differences = df.groupby('Names', as_index=False).apply(lambda g: g['Numbers'] - g['Numbers'].shift(-1)).fillna(-1).reset_index()
missing_numbers = (df[differences != -1]['Numbers'].dropna()+1).tolist()
print(missing_numbers)
Output:
[9.0]
I'm not sure itertools is needed here. Here is one solution only using pandas methods.
Group the data according to Names column using groupby
Select the min and max from Numbers columns
Define an integer range from min to max
merge this value with the sub dataframe
Filter according missing values using isna
Return the filtered df
Optional : reindex the columns for prettier output with reset_index
Here the code:
df = pd.DataFrame({"Numbers": [0, 1, 2, 3, 4, 5, 6, 8, 10, 15],
"Names": ["A", "A", "B", "B", "C", "C", "C", "D", "D", "D"]})
def select_missing(df):
# Select min and max values
min_ = df.Numbers.min()
max_ = df.Numbers.max()
# Create integer range
serie = pd.DataFrame({"Numbers": [i for i in range(min_, max_ + 1)]})
# Merge with df
m = serie.merge(df, on=['Numbers'], how='left')
# Return rows not matching the equality
return m[m.isna().any(axis=1)]
# Group the data per Names and apply "select_missing" function
out = df.groupby("Names").apply(select_missing)
print(out)
# Numbers Names
# Names
# D 1 9 NaN
# 3 11 NaN
# 4 12 NaN
# 5 13 NaN
# 6 14 NaN
out = out[["Numbers"]].reset_index(level=0)
print(out)
# Names Numbers
# 1 D 9
# 3 D 11
# 4 D 12
# 5 D 13
# 6 D 14

Populate pandas dataframe using column and row indices as variables

Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i
You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35
Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))

get a random item from a group of rows in a xlsx file in python

I have a xlsx file, for example:
A B C D E F G
1 5 2 7 0 1 8
3 4 0 7 8 5 9
4 2 9 7 0 6 2
1 6 3 2 8 8 0
4 3 5 2 5 7 9
5 2 3 2 6 9 1
being my values (that are actually on an excel file).
I nedd to get random rows of it, but separeted for column D values.
You can note that column D has values that are 7 and values that are 2.
I need to get 1 random row of all the rows that have 7 on column D and 1 random row of all the rows that have 2 on column D.
And put the results on another xlsx file.
My expected output needs to be the content of line 0, 1 or 2 and the content of line 3, 4 or 5.
Can someone help me with that?
Thanks!
I've created the code to that. The code below assumes that the excel name is test.xlsx and resides in the same folder as where you run your code. It samples NrandomLines from each unique value in column D and prints that out.
import pandas as pd
import numpy as np
import random
df = pd.read_excel('test.xlsx') # read the excel
vals = df.D.unique() # all unique values in column D, in your case its only 2 and 7
idx = []
N = []
for i in vals: # loop over unique values in column D
locs = (df.D==i).values.nonzero()[0]
idx = idx + [locs] # save row index of every unique value in column D
N = N + [len(locs)] # save how many rows contain specific value in D
NrandomLines = 1 # how many random samples you want
for i in np.arange(len(vals)): # loop over unique values of D
for k in np.arange(NrandomLines): # loop how many random samples you want
randomRow = random.randint(0,N[i]-1) # create random sample
print(df.iloc[idx[i][randomRow],:]) # print out random row
With OpenPyXl, you can use Worksheet.iter_rows to iterate the worksheet rows.
You can use itertools.groupby to group the row according to the "D" column values.
To do that, you can create a small function to pick-up this value in a row:
def get_d(row):
return row[3].value
Then, you can use random.choice to choose a row randomly.
Putting all things togather, you can have:
def get_d(row):
return row[3].value
for key, group in itertools.groupby(rows, key=get_d):
row = random.choice(list(group))
print(row)

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources