Array Manipulation to DataFrame - python

I have the following array:
(array([[5.8205872e+07, 2.0200601e+07, 1.6700000e+02, 2.1500000e+02,
5.0000000e+01, 5.0000000e+00],
[5.7929117e+07, 2.0200601e+07, 1.6700000e+02, 1.5000000e+02,
5.0000000e+01, 5.0000000e+00],
[5.8178782e+07, 2.0200601e+07, 1.6700000e+02, 1.5750000e+02,
5.0000000e+01, 5.0000000e+00],
[5.7936230e+07, 2.0210228e+07, 1.6700000e+02, 1.8000000e+02,
4.0000000e+01, 5.0000000e+00],
[5.8213574e+07, 2.0210228e+07, 1.6700000e+02, 6.9500000e+02,
4.0000000e+01, 5.0000000e+00],
[2.5693916e+07, 2.0210228e+07, 1.6700000e+02, 4.8518000e+02,
4.0000000e+01, 5.0000000e+00]]),
array([[ 0.46666667, 7.16666667],
[ 0.51724138, 5.17241379],
[ 0.73333333, 5.25 ],
[ 0.34285714, 5.14285714],
[ 1.18918919, 18.78378378],
[ 1.26315789, 12.76789474]]))
I would like to transform it to a data frame that has 8 columns and six rows in total.
I tried to do : pd.Dataframe(my_array) but the result is just two rows like this:
0 [[58205872.0, 20200601.0, 167.0, 30.0, 1.0, 10...
1 [[0.4666666666666667, 7.166666666666667], [0.5...
How can I achieve what is described above?

It looks like you want to concatenate your two arrays (indeed you do have two arrays assigned to my_array) and then turn the result into a dataframe. What about first using numpy.hstack
>>> your_two_arrays = (..., ...)
>>> a = np.hstack(your_two_arrays)
>>> a.shape
(6, 8)
and finally pandas.DataFrame
>>> pd.DataFrame(data=a)
0 1 2 3 4 5 6 7
0 58205872.0 20200601.0 167.0 215.00 50.0 5.0 0.466667 7.166667
1 57929117.0 20200601.0 167.0 150.00 50.0 5.0 0.517241 5.172414
2 58178782.0 20200601.0 167.0 157.50 50.0 5.0 0.733333 5.250000
3 57936230.0 20210228.0 167.0 180.00 40.0 5.0 0.342857 5.142857
4 58213574.0 20210228.0 167.0 695.00 40.0 5.0 1.189189 18.783784
5 25693916.0 20210228.0 167.0 485.18 40.0 5.0 1.263158 12.767895
[...] the result is just two rows like this: [...]
The data that you were providing to pd.Dataframe when doing pd.Dataframe(my_array) was a tuple of two objects. Hence the two rows you got (and one column), i.e. one per array.

Related

The Multi indexing here is not clear where it starts, or maybe i do not understand it well

I need to understand the slicing in multiIndexing, for example:
health_data
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
And by doing the following command:
health_data.iloc[:2, :2]
I get back:
subject Bob
type HR Temp
year visit
2013 1 31.0 38.7
2 44.0 37.7
Can anybody please tell me why the result is like this? From where do we start the index in multi indexed matrix?
If I interpret your data correctly, we can rebuild your df as follows, using pd.MultiIndex.from_tuples and the regular df constructor pd.DataFrame, to gain some clarity about its structure:
import pandas as pd
import numpy as np
tuples_columns = [('Bob', 'HR'), ('Bob', 'Temp'), ('Guido', 'HR'),
('Guido', 'Temp'), ('Sue', 'HR'), ('Sue', 'Temp')]
columns = pd.MultiIndex.from_tuples(tuples_columns, names=['subject', 'type'])
tuples_index = [(2013, 1), (2013, 2), (2014, 1), (2014, 2)]
index = pd.MultiIndex.from_tuples(tuples_index, names=['year', 'visit'])
data = np.array([[31. , 38.7, 32. , 36.7, 35. , 37.2],
[44. , 37.7, 50. , 35. , 29. , 36.7],
[30. , 37.4, 39. , 37.8, 61. , 36.9],
[47. , 37.8, 48. , 37.3, 51. , 36.5]])
health_data = pd.DataFrame(data=data, columns=columns, index=index)
print(health_data)
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
As you can see from this snippet, both your columns and index are MultiIndices with names for each level (2 levels for both: 0 and 1), which we find in the top left corner of the print. N.B. The names are not part of the columns/index in the sense that you cannot use them directly to select from the df. E.g. your columns start with (Bob, HR), not with subject and/or type. You can of course select the names if you want to:
print(health_data.columns.names)
['subject', 'type']
Or indeed, you can also reset them to None values, in which case they will disappear, without otherwise affecting the structure of your df:
health_data.columns.names = [None, None]
health_data.index.names = [None, None]
print(health_data)
Bob Guido Sue
HR Temp HR Temp HR Temp
2013 1 31.0 38.7 32.0 36.7 35.0 37.2
2 44.0 37.7 50.0 35.0 29.0 36.7
2014 1 30.0 37.4 39.0 37.8 61.0 36.9
2 47.0 37.8 48.0 37.3 51.0 36.5
The other confusing thing is probably that the values from the first level (0) are not repeated: they become blanks when they appear as duplicates. Not to worry, they are still there. This is just done to provide a better sense of the relation between the different levels. E.g. your actual index values look like this:
print(health_data.index)
MultiIndex([(2013, 1),
(2013, 2),
(2014, 1),
(2014, 2)],
names=['year', 'visit'])
But since 2013 occurs in both (2013, 1), (2013, 2), this is displayed as if they are (2013, 1), ('', 2). When you get used to this notation, it is actually much easier to see, e.g. that you just have two years (2013, 2014) with two sub levels (i.e. visit) for each: 1, 2.
Lastly, let's review your df.iloc example:
health_data.iloc[:2, :2]
subject Bob
type HR Temp
year visit
2013 1 31.0 38.7
2 44.0 37.7
We can see now how this works: we are selecting :2 from the index (so: 0, 1) and same for the columns. subject and type are just the names for the columns, year and visit just the names for the index, while Bob and 2013 are not repeated in the respective levels 0 of both MultiIndices since they are duplicates.
Suppose we want to select the same data using df.loc, we could do this as follows:
health_data.loc[[(2013,1),(2013,2)], [('Bob','HR'),('Bob','Temp')]]
# same result
Or, perhaps more conveniently, we make use of index.get_level_values, and do something like this:
health_data.loc[health_data.index.get_level_values(0) == 2013,
health_data.columns.get_level_values(0) == 'Bob']
# same result

Create new columns based on previous columns with multiplication

I want to create a list of columns where the new columns are based on previous columns times 1.5. It will roll until Year 2020. I tried to use previous and current but it didn't work as expected. How can I make it work as expected?
df = pd.DataFrame({
'us2000':[5,3,6,9,2,4],
}); df
a = []
for i in range(1, 21):
a.append("us202" + str(i))
for previous, current in zip(a, a[1:]):
df[current] = df[previous] * 1.5
IIUC you can fix you code with:
a = []
for i in range(0, 21):
a.append(f'us20{i:02}')
for previous, current in zip(a, a[1:]):
df[current] = df[previous] * 1.5
Another, vectorial, approach with numpy would be:
df2 = (pd.DataFrame(df['us2000'].to_numpy()[:,None]*1.5**np.arange(21),
columns=[f'us20{i:02}' for i in range(21)]))
output:
us2000 us2001 us2002 us2003 us2004 us2005 us2006 us2007 ...
0 5 7.5 11.25 16.875 25.3125 37.96875 56.953125 85.429688
1 3 4.5 6.75 10.125 15.1875 22.78125 34.171875 51.257812
2 6 9.0 13.50 20.250 30.3750 45.56250 68.343750 102.515625
3 9 13.5 20.25 30.375 45.5625 68.34375 102.515625 153.773438
4 2 3.0 4.50 6.750 10.1250 15.18750 22.781250 34.171875
5 4 6.0 9.00 13.500 20.2500 30.37500 45.562500 68.343750
Try:
for i in range(1, 21):
df[f"us{int(2000+i):2d}"] = df[f"us{int(2000+i-1):2d}"].mul(1.5)
>>> df
us2000 us2001 us2002 ... us2018 us2019 us2020
0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650
1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190
2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380
3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571
4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460
5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920
[6 rows x 21 columns]
pd.DataFrame(df.to_numpy()*[1.5**i for i in range(0,21)])\
.rename(columns=lambda x:str(x).rjust(2,'0')).add_prefix("us20")
out
us2000 us2001 us2002 ... us2018 us2019 us2020
0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650
1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190
2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380
3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571
4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460
5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920
[6 rows x 21 columns]

Cartesian product of all items for a given group using pandas

I am starting with a DataFrame that looks like this:
id tof
0 43.0 1999991.0
1 43.0 2095230.0
2 43.0 4123105.0
3 43.0 5560423.0
4 46.0 2098996.0
5 46.0 2114971.0
6 46.0 4130033.0
7 46.0 4355096.0
8 82.0 2055207.0
9 82.0 2093996.0
10 82.0 4193587.0
11 90.0 2059360.0
12 90.0 2083762.0
13 90.0 2648235.0
14 90.0 4212177.0
15 103.0 1993306.0
.
.
.
and ultimately my goal is to create a very long two dimensional array that contains all combinations of items with the same id like this (for rows with id 43):
[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]
except changing all the tuples to arrays so that I could transpose the array after iterating over all id numbers.
Naturally, itertools came to mind, and my first thought was doing something with df.groupby('id') so that it would apply itertools internally to every group with the same id, but I would guess that this would take absolutely forever with the million line datafiles I have.
Is there a vectorized way to do this?
IIUC:
from itertools import combinations
pd.DataFrame([
[k, c0, c1] for k, tof in df.groupby('id').tof
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
id tof0 tof1
0 43.0 1999991.0 2095230.0
1 43.0 1999991.0 4123105.0
2 43.0 1999991.0 5560423.0
3 43.0 2095230.0 4123105.0
4 43.0 2095230.0 5560423.0
5 43.0 4123105.0 5560423.0
6 46.0 2098996.0 2114971.0
7 46.0 2098996.0 4130033.0
8 46.0 2098996.0 4355096.0
9 46.0 2114971.0 4130033.0
10 46.0 2114971.0 4355096.0
11 46.0 4130033.0 4355096.0
12 82.0 2055207.0 2093996.0
13 82.0 2055207.0 4193587.0
14 82.0 2093996.0 4193587.0
15 90.0 2059360.0 2083762.0
16 90.0 2059360.0 2648235.0
17 90.0 2059360.0 4212177.0
18 90.0 2083762.0 2648235.0
19 90.0 2083762.0 4212177.0
20 90.0 2648235.0 4212177.0
Explanation
This is a list comprehension that returns a list of lists wrapped up by a dataframe constructor. Look up comprehensions to understand better.
from itertools import combinations
pd.DataFrame([
# name series of tof values
# ↓ ↓
[k, c0, c1] for k, tof in df.groupby('id').tof
# items from combinations
# first second
# ↓ ↓
for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])
from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))
if you'd prefer that elements don't repeat, you can use
from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))
Groupby does work:
def get_product(x):
return pd.MultiIndex.from_product((x.tof, x.tof)).values
for i, g in df.groupby('id'):
print(i, get_product(g))
Output:
43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
(1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
(2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
(4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
(5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
(5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
(2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
(2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
(4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
(4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
(4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
(2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
(4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
(2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
(2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
(2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
(4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
(4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]

Filter one DataFrame by unique values in another DataFrame

I have 2 Python Dataframes:
The first Dataframe contains all data imported to the DataFrame, which consists of "prodcode", "sentiment", "summaryText", "reviewText",etc. of all initial Review Data.
DFF = DFF[['prodcode', 'summaryText', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful','reviewTime', 'unixReviewTime', 'sentiment','textLength']]
which produces:
prodcode summaryText reviewText overall reviewerID ... helpful reviewTime unixReviewTime sentiment textLength
0 B00002243X Work Well - Should Have Bought Longer Ones I needed a set of jumper cables for my new car... 5.0 A3F73SC1LY51OO ... [4, 4] 08 17, 2011 1313539200 2 516
1 B00002243X Okay long cables These long cables work fine for my truck, but ... 4.0 A20S66SKYXULG2 ... [1, 1] 09 4, 2011 1315094400 2 265
2 B00002243X Looks and feels heavy Duty Can't comment much on these since they have no... 5.0 A2I8LFSN2IS5EO ... [0, 0] 07 25, 2013 1374710400 2 1142
3 B00002243X Excellent choice for Jumper Cables!!! I absolutley love Amazon!!! For the price of ... 5.0 A3GT2EWQSO45ZG ... [19, 19] 12 21, 2010 1292889600 2 4739
4 B00002243X Excellent, High Quality Starter Cables I purchased the 12' feet long cable set and th... 5.0 A3ESWJPAVRPWB4 ... [0, 0] 07 4, 2012 1341360000 2 415
The second Dataframe is a grouping of all prodcodes and the ratio of sentiment score / all reviews made for that product. It is the ratio for that review score over all reviews scores made, for that particular product.
df1 = (
DFF.groupby(["prodcode", "sentiment"]).count()
.join(DFF.groupby("prodcode").count(), "prodcode", rsuffix="_r"))[['reviewText', 'reviewText_r']]
df1['result'] = df1['reviewText']/df1['reviewText_r']
df1 = df1.reset_index()
df1 = df1.pivot("prodcode", 'sentiment', 'result').fillna(0)
df1 = round(df1 * 100)
df1.astype('int')
sorted_df2 = df1.sort_values(['0', '1', '2'], ascending=False)
which produces the following DF:
sentiment 0 1 2
prodcode
B0024E6QOO 80.0 0.0 20.0
B000GPV2QA 67.0 17.0 17.0
B0067DNSUI 67.0 0.0 33.0
B00192JH4S 62.0 12.0 25.0
B0087FSA0C 60.0 20.0 20.0
B0002KM5L0 60.0 0.0 40.0
B000DZBP60 60.0 0.0 40.0
B000PJCBOE 60.0 0.0 40.0
B0033A5PPO 57.0 29.0 14.0
B003POL69C 57.0 14.0 29.0
B0002Z9L8K 56.0 31.0 12.0
What I am now trying to do filter my first dataframe in two ways. The first, by the results of the second dataframe. By that, I mean I want the first dataframe to be filtered by the prodcode's from the second dataframe where df1.sentiment['0'] > 40. From that list, I want to filter the first dataframe by those rows where 'sentiment' from the first dataframe = 0.
At a high level, I am trying to obtain the prodcode, summaryText and reviewText in the first dataframe for Products that had high ratios in lower sentiment scores, and whose sentiment is 0.
Something like this :
assuming all the data you need is in df1 and no merges are needed.
m = list(DFF['prodcode'].loc[DFF['sentiment'] == 0] # create a list matching your criteria
df.loc[(df['0'] > 40) & (df['sentiment'].isin(m)] # filter according to your conditions
I figured it out:
DF3 = pd.merge(DFF, df1, left_on='prodcode', right_on='prodcode')
print(DF3.loc[(DF3['0'] > 50.0) & (DF3['2'] < 50.0) & (DF3['sentiment'].isin(['0']))].sort_values('0', ascending=False))

Populating new DataFrame by multi-criteria selection from old one with different structure

I'm using Pandas for data analysis. I have an input file like this snippet:
VEH SEC POS ACCELL SPEED
2 8.4 36.51 -0.2929 27.39
3 8.4 23.57 -0.7381 33.09
4 8.4 6.18 0.6164 38.8
1 8.5 47.76 0 25.57
I need to reorganize the data so that the rows are the unique (ordered) values from SEC as the 1st column, and then the other columns would be VEH1_POS, VEH1_SPEED, VEH1_ACCELL, VEH2_POS, VEH2_SPEED, VEH2_ACCELL, etc.:
TIME VEH1_POS VEH1_SPEED VEH1_ACCEL VEH2_POS, VEH2_SPEED, etc.
0.1 6.2 3.7 0.0 7.5 2.1
0.2 6.8 3.2 -0.5 8.3 2.1
etc.
So, for example, the value for VEH1_POS for each row in the new dataframe would be filled in by selecting values from the POS column in the original dataframe using the row where the SEC value matches the TIME value for the row in the new dataframe and the VEH value == 1.
To set up the rows in the new data frame I'm doing this:
start = inputdf['SIMSEC'].min()
end = inputdf['SIMSEC'].max()
time_steps = frange(start, end, 0.1)
outputdf['TIME'] = time_steps
But I'm lost at how to select the right values from the input dataframe and create the rest of the new dataframe for further analysis. Note also that the input file will NOT have data for every VEH for every SEC (time stamp). So the solution needs to handle that as well. My best guess was:
outputdf['veh1_pos'] = np.where((inputdf['VEH NO'] == 1) & (inputdf['SIMSEC'] == row['Time Step']))
but that doesn't work.
import pandas as pd
# your data
# ==========================
print(df)
Out[272]:
VEH SEC POS ACCELL SPEED
0 2 8.4 36.51 -0.2929 27.39
1 3 8.4 23.57 -0.7381 33.09
2 4 8.4 6.18 0.6164 38.80
3 1 8.5 47.76 0.0000 25.57
# reshaping
# ==========================
result = df.set_index(['SEC','VEH']).unstack()
Out[278]:
POS ACCELL SPEED
VEH 1 2 3 4 1 2 3 4 1 2 3 4
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
So here, the column has multi-level index where 1st level is POS, ACCELL, SPEED and 2nd level is VEH=1,2,3,4.
# if you want to rename the column
temp_z = result.columns.get_level_values(0)
temp_y = result.columns.get_level_values(1)
temp_x = ['VEH'] * len(temp_y)
result.columns = ['{}{}_{}'.format(x,y,z) for x,y,z in zip(temp_x, temp_y, temp_z)]
Out[298]:
VEH1_POS VEH2_POS VEH3_POS VEH4_POS VEH1_ACCELL VEH2_ACCELL VEH3_ACCELL VEH4_ACCELL VEH1_SPEED VEH2_SPEED VEH3_SPEED VEH4_SPEED
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN

Categories

Resources