I have a numpy.ndarray data that looks like below and I want to flatten it out so that i can manipulate it. Please find my sample data below:
sample_data=[list([{'region': 'urn:li:region:9194', 'followerCounts': {'organicFollowerCount': 157, 'paidFollowerCount': 0}}, {'region': 'urn:li:region:7127', 'followerCounts': {'organicFollowerCount': 17, 'paidFollowerCount': 0}}])]
I have tried to use the following code but no luck yet:
sample.flatter()
The desired output is as follows:
region organicFollowerCount paidFollowerCount
urn:li:region:9194 157 0
urn:li:region:7127 17 0
Can anyone help me achieving this please?
Here is an approach that uses pd.json_normalize:
import pandas as pd
# note that `sample data` has been modified into a list of dictionaries
sample_data = [
{'region': 'urn:li:region:9194',
'followerCounts': {'organicFollowerCount': 157, 'paidFollowerCount': 0}},
{'region': 'urn:li:region:7127',
'followerCounts': {'organicFollowerCount': 17, 'paidFollowerCount': 0}}
]
Now, convert each item in the list to a data frame:
dfs = list()
# convert one dict at a time into a data frame, using json_normalize()
for sd in sample_data:
t = pd.json_normalize(sd)
dfs.append(t)
# convert list of dataframes into a single data frame,
# and change column labels
t = pd.concat(dfs).rename(columns={
'followerCounts.organicFollowerCount': 'organicFollowerCount',
'followerCounts.paidFollowerCount': 'paidFollowerCount'
}).set_index('region')
print(t)
organicFollowerCount paidFollowerCount
region
urn:li:region:9194 157 0
urn:li:region:7127 17 0
As #thehumaneraser noted, this format is not ideal, but we can't always influence the format of the data we receive.
You are not going to be able to flatten this data the way you want with Numpy's flatten method. That method simply takes a multi-dimensional ndarray and flattens it to one dimension. You can read the docs here.
A couple other things. First of all, your sample data above is not an ndarray, it is just a python list. And actually since you call list() inside square brackets it is a nested list of dictionaries. This is really not a good way to store this information and based on this convoluted format you leave yourself very few options for nicely "flattening" it into the table you desire.
If you have many rows like this I would do the following:
headers = ["region", "organicFollowerCount", "paidFollowerCount"]
data = [headers]
for row in sample_data[0]: # Subindexing here because it is unwisely a nested list
formatted_row = []
formatted_row.append(row["region"])
formatted_row.append(row["followerCounts"]["organicFollowerCount"])
formatted_row.append(row["followerCounts"]["paidFollowerCount"]
data.append(formatted_row)
data = np.array(data)
This will give you an ndarray of the data as you have it here, but this is still an ugly solution. Really this is a highly impractical presentation of data and you should ditch it for a better one.
One last thing: don't use camel case. That is standard practice for some languages like Java but nor for Python. Instead of organicFollowerCount use organic_follower_count and so on.
Related
I've read a few other posts about this but the other solutions haven't worked for me. I'm trying to look at 2 different CSV files and compare data from 1 column from each file. Here's what I have so far:
import pandas as pd
import numpy as np
dataBI = pd.read_csv("U:/eu_inventory/EO BI Orders.csv")
dataOrderTrimmed = dataBI.iloc[:,1:2].values
dataVA05 = pd.read_csv("U:\eu_inventory\VA05_Export.csv")
dataVAOrder = dataVA05.iloc[:,1:2].values
dataVAList = []
ordersInBoth = []
ordersInBI = []
ordersInVA = []
for order in np.nditer(dataOrderTrimmed):
if order in dataVAOrder:
ordersInBoth.append(order)
else:
ordersInBI.append(order)
So if the order number from dataOrderTrimmed is also in dataVAOrder I want to add it to ordersInBoth, otherwise I want to add it to ordersInBI. I think it splits the information correctly but if I try to print ordersInBoth each item prints as array(5555555, dtype=int64) I want to have a list of the order numbers not as an array and not including the dtype information. Let me know if you need more information or if the way I've typed it out is confusing. Thanks!
The way you're using .iloc is giving you a DataFrame, which becomes 2D array when you access values. If you just want the values in the column at index 1, then you should just say:
dataOrderTrimmed = dataBI.iloc[:, 1].values
Then you can iterate over dataOrderTrimmed directly (i.e. you don't need nditer), and you will get regular scalar values.
I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.
I have a dictionary of constants which needs to be multiplied to a data frame. Can any one provide guidance on how to handle this situation or suggest efficient data structure?
for example constant dictionary is like
dct = {a : [0.1,0.22,0.13],b : [0.544,0.65,0.17],c : [0.13,0.544,0.65]}
and then I have a dataframe.
d = {'ID': ['A1','A2','A3'],'AAA':[0,0.4,0.8],'AA':[0,0.6,0.1],'A':[0,0.72,0.32],'BBB':[0,0.55,0.66]}
df2 = pd.DataFrame(data=d)
What I want to do is pick each constant from array and apply some complex function to a data frame based on condition. Could you advice me on data structure I should use?
I also thought about converting data frame to dictionary and then perform scaler operation using zip but that doesn't seem right.
i.e.
df2.set_index('ID').T.to_dict(orient='list')
for k,v in dct.items():
for k1,v1 in df2_dct.items():
#here I have two lists v and v1 to perform operation but this is not efficient.
operations are like if value is 0, ignore, if less then 0.5 than some formula, else if greater than 0.5 then some formula.
I'd appreciate any advice.
EDIT1:
Another Idea I have is to iterate through each key,value of dictionary 'dct' , add value list a dataframe column and then perform operation. It is totally doable and fast but then how do I store all 3 dataframes?
EDIT2:
Scaler operations are like:
tmp_list = list()
for i in range(len(v1)):
if v1[i] ==0:
temp_list[i]=v1[i]
if v1[i]>0.5:
temp_list[i]=v1[i]*4 + v[i]^2
if v1[i] <0.5:
temp_list[i]=v1[i]*0.24+v[i]
EDIT3:
expected output would be either dictionary or dataframe. In Case of dictionary, it will be a nested dictionary like,
op_dct = {'a':{'A1':[values],'A2':[values],'A3':[values]},
'b':{'A1':[values],'A2':[values],'A3':[values]},
'c':{'A1':[values],'A2':[values],'A3':[values]} }
so that I can access vector like op_dct[constant_type][ID].
multiple dataframes doesn't seem right option.
Is there a way to estimate the size a dataframe would be without loading it into memory? I already know that I do not have enough memory for the dataframe that I am trying to create but I do not know how much more memory would be required to fully create it.
You can calculate for one row, and estimate based on it:
data = {'name': ['Bill'],
'year': [2012],
'num_sales': [4]}
df = pd.DataFrame(data, index = ['sales'])
df.memory_usage(index=True).sum() #-> 32
I believe you're looking for df.memory_usage, which would tell you how much each column will occupy.
Altogether it would go something like:
df.memory_usage().sum()
Output:
123123000
You can do more specifics things like including Index (Index = True) or using the Deep feature which will "introspect the data deeply". Feel free to check the documentation!
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html
Not sure if this is a good idea after all, but having a dictionary with arrays as values, such as
DF = {'z_eu': array([127.45064758, 150.4478288 , 150.74781189, -98.3227338 , -98.25155681, -98.24993753]),
'Process': array(['initStep', 'Transportation', 'Transportation', 'Transportation', 'Transportation', 'phot']),
'Creator': array(['SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad', 'SynRad']) }
I need to do a selection of the numeric data (z_eu) based on values of the other two keys.
One workaround I came up with so far, was to extract the arrays and iterate through them, thereby creating another array which contains the valid data.
proc = DF['Process']; z= DF['z_eu']; creat = DF['Creator']
data = [z for z,p,c in zip(z, proc,creat) if (p == 'initStep') and c=='SynRad' ]
But somehow this seems to me as effort which can be completely avoided by dealing more intelligently with the dictionary in the first place? Also, the zip() takes a long time as well.
I know that dataframes are a valid alternative but unfortunately, since I'm dealing with strings, pandas appears to be too slow.
Any hints are most welcome!
A bit simpler, using conditional slicing you could write
data = DF['z_eu'][(DF['Process'] == 'initStep') & (DF['Creator'] == 'SynRad')]
...or still using zip, you could simplify to
data = [z for z, p, c in zip(*DF.values()) if p == 'initStep' and c == 'SynRad']
Basically also conditional slicing, using a pandas DataFrame:
df = pd.DataFrame(DF)
data = df.loc[(df['Process'] == 'initStep') & (df['Creator'] == 'SynRad'), 'z_eu']
print(data)
# 0 127.450648
# Name: z_eu, dtype: float64
In principle I'd say there's nothing wrong with handling numpy arrays in a dict. You'll have a lot of flexibility and sometimes operations are more efficient if you do them straight in numpy (you could even utilize numba for purely numerical, expensive calculations) - but if that is not needed and you're fine with basically a n*m table, pandas dfs are nice and convenient.
If your dataset is large and you want to perform many look-ups as the one shown, you might not want to perform those on strings. To improve performance, you could e.g. come up with unique IDs (integers) for each 'Process' or 'Creator' from the example. You'll just need to be able to map those back to the original strings, so keep that data as well.
You can loop through one array and via the index get the right element
z_eu = DF['z_eu']
process = DF['Process']
creator = DF['Creator']
result = []
for i in range(len(z_eu)):
if process[i] == 'initStep' and creator[i] == 'SynRad':
result.append(z_eu[i])
print(result)