I have a dataframe of 51 rows and 464 columns , the columns contain 1's and 0's. I want to have a encoded value of the hex as you see in the attached picture.
I was trying to use numpy to make the hex conversion but it would fail
df = pd.DataFrame(np.random.randint(0,2,size=(51, 464)))
#converting into numpy for easier shifting
a = df.values
b = a.dot(2**np.arange(a.size)[::-1])
I want to have every 4 columns grouped to produce the hexadecimal value and then if there are odd columns for ex: 463 instead of 464 then the trailing hexadecimal will be padded with zero or zeroes based on how many ever needed to make the full hexadecimal value
This code only works for 64 bits length and then fails.
I was following this example
binary0|1 to hex string
any suggestions on how to do this?
Doesn't this do what you want?
df.apply(lambda row: hex(int(''.join(map(str, row)), base=2)), axis=1)
Convert to string every number in a row
Join them to create one big number in string
Convert it to integer with base 2 (since a row is in binary format)
Convert it to hex
Edit: To convert every 4 piece with the same manner:
def hexize(row):
hexes = '0x'
row = ''.join(map(str, row))
for i in range(0, len(row), 4):
value = row[i:i+4]
value = value.ljust(4, '0') # right fill with 0
value = hex(int(value, base=2))
hexes += value[2:]
return hexes
df.apply(hexize, axis=1)
hexize('011101100') # returns '0x760'
Given input data:
ECID,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20,T21,T22,T23,T24,T25,T26,T27,T28,T29,T30,T31,T32,T33,T34,T35,T36,T37,T38,T39,T40,T41,T42,T43,T44,T45,T46,T47,T48,T49,T50,T51
ABC123,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
XYZ345,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DEF789,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
434thECID,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This adds an "Encoded" column similar to what was asked. The first row example in the original question seems to have the wrong number of Fs:
import pandas as pd
def encode(row):
s = ''.join(str(x) for x in row[1:]) # Create binary string
s += '0' * (4 - len(row[1:]) % 4) # Make length a multiple of 4 by adding zeros
i = int(s,2) # convert to integer base 2
h = hex(i).rstrip('0') # strip trailing zeros
return h if h != '0x' else '0x0' # Handle special case of '0x0' stripping to '0x'
df = pd.read_csv('input.csv')
df['Encoded'] = df.apply(encode,axis=1)
print(df)
Output:
ECID T1 T2 T3 T4 T5 ... T47 T48 T49 T50 T51 Encoded
0 ABC123 1 1 1 1 1 ... 1 1 1 1 1 0xffffffffffffe
1 XYZ345 1 0 0 0 0 ... 0 0 0 0 0 0x8
2 DEF789 1 0 1 0 1 ... 0 0 0 0 0 0xaa
3 434thECID 0 0 0 0 0 ... 0 0 0 0 0 0x0
[4 rows x 53 columns]
I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step
I have a column with land dimensions in Pandas. It looks like this:
df.LotSizeDimensions.value_counts(dropna=False)
40.00X150.00 2
57.00X130.00 2
27.00X117.00 2
63.00X135.00 2
37.00X108.00 2
65.00X134.00 2
57.00X116.00 2
33x124x67x31x20x118 1
55.00X160.00 1
63.00X126.00 1
36.00X105.50 1
In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with
def dimensions_split(df: pd.DataFrame):
df.LotSizeDimensions = df.LotSizeDimensions.str.strip()
df.LotSizeDimensions = df.LotSizeDimensions.str.upper()
df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M')
if df.LotSizeDimensions.count('X') > 1
return 0
df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1]))
This is coming back with the following error:
TypeError: cannot convert the series to <class 'int'>
I would also like to add a line where if there are any non-numeric characters other than X, return a zero.
Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0:
s = df.LotSizeDimensions.str.strip('`"M ').str.upper()
df1 = s.str.split('X', expand=True).astype(float)
#general data
#df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0)
print (df)
LotSizeDimensions LotSize
0 40.00X150.00 6000.0
1 57.00X130.00 7410.0
2 27.00X117.00 3159.0
3 37.00X108.00 3996.0
4 63.00X135.00 8505.0
5 65.00X134.00 8710.0
6 57.00X116.00 6612.0
7 33x124x67x31x20x118 0.0
8 55.00X160.00 8800.0
9 63.00X126.00 7938.0
10 36.00X105.50 3798.0
I get this using list comprehension:
import pandas as pd
df = pd.DataFrame(['40.00X150.00','57.00X130.00',
'27.00X117.00',
'37.00X108.00',
'63.00X135.00' ,
'65.00X134.00' ,
'57.00X116.00' ,
'33x124x67x31x20x118',
'55.00X160.00',
'63.00X126.00',
'36.00X105.50'])
df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]
I am trying to change the values of a very long column (about 1mio entries) in a data frame. I have something like
####ID_Orig
3452
3452
3452
6543
6543
...
I want something like
####ID_new
0
0
0
1
1
...
At the moment I'm doing this:
j=0
for i in range(0,1199531):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
Which takes about ages... Is there a faster way to do this?
I don't know what values ID_orig has and how often a single value comes up.
Use factorize, but if duplicated groups then output values are set to same number.
Another solution with comparing by ne (!=) of shifted values with cumsum is more general - create always new values, also if repeating group values:
df['ID_new1'] = pd.factorize(df['ID_Orig'])[0]
df['ID_new2'] = df['ID_Orig'].ne(df['ID_Orig'].shift()).cumsum() - 1
print (df)
ID_Orig ID_new1 ID_new2
0 3452 0 0
1 3452 0 0
2 3452 0 0
3 6543 1 1
4 6543 1 1
5 100 2 2
6 100 2 2
7 6543 1 3 <-repeating group
8 6543 1 3 <-repeating group
You can do this …
import collections
l1 = [3452, 3452, 3452, 6543, 6543]
c = collections.Counter(l1)
l2 = list(c.items())
l3 = []
for i, t in enumerate(l2):
for x in range(t[1]):
l3.append(i)
for x in l3:
print(x)
This is the output:
0
0
0
1
1
You can use the following. In the following implementation duplicate ids in the original id will get same ids. The implementation is based on dropping duplicates from the column and assigning a different number to each unique id to form the enw ids. These new ids are then merged into the original dataset
import numpy as np
import pandas as pd
from time import time
num_rows = 119953
input_data = np.random.randint(1199531, size=(num_rows,1))
data = pd.DataFrame(input_data)
data.columns = ["ID_orig"]
data2 = pd.DataFrame(input_data)
data2.columns = ["ID_orig"]
t0 = time()
j=0
for i in range(0,num_rows-1):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
t1 = time()
id_new = data2.loc[:,"ID_orig"].drop_duplicates().reset_index().drop("index", axis=1)
id_new.reset_index(inplace=True)
id_new.columns = ["id_new"] + id_new.columns[1:].values.tolist()
data2 = data2.merge(id_new, on="ID_orig")
t2 = time()
print("Previous: ", round(t1-t0, 2), " seconds")
print("Current : ", round(t2-t1, 2), " seconds")
The output of the above program using only 119k rows is
Previous: 12.16 seconds
Current : 0.06 seconds
The runtime difference increases even more as the number of rows are increased.
EDIT
Using the same number of rows:
>>> print("Previous: ", round(t1-t0, 2))
Previous: 11.7
>>> print("Current : ", round(t2-t1, 2))
Current : 0.06
>>> print("jezrael's answer : ", round(t3-t2, 2))
jezrael's answer : 0.02