I'm using Hortonworks sandbox under the version of 2.5. The zeppelin service running successfully, when i create a zeppelin notebook with sample data in csv file, For eg; list of data available below wise;
+----------------------------------------+
| id name specialization county state |
+----------------------------------------+
| 001 xxxx Android Bronx NY |
+----------------------------------------+
| 002 yyyy ROR Rome NY |
+----------------------------------------+
| 003 zzzz Bigdata Bronx NY |
+----------------------------------------+
| 004 pppp IOS Dallas TX |
+----------------------------------------+
| 005 qqq IOS Dallas TX |
+----------------------------------------+
I have a pie,bar charts,sql table.In pie chart list of states available like TX with respective count on pie chart.
When i click over pie chart for the value TX portion, i want do dynamically data has been filtered in the entire notebook in all widgets like sql table,bar chart,etc. But i got all data has been display in sql table and below table contain 70,000 records, i want only tx state records only.
[![enter image description here][2]][2]
Please tell me how do i make this functionality in zeppelin.
As of 0.7.0, you can create your own charts like https://github.com/1ambda/zeppelin-highcharts-columnrange.
It's called Helium (Pluggable) Visualization (Chart)
Here are some resources you can refer
All available helium visualizations: http://zeppelin.apache.org/helium_packages.html
How to write new helium visualization: http://zeppelin.apache.org/docs/0.7.0/development/writingzeppelinvisualization.html
Zeppelin built in samples: https://github.com/apache/zeppelin/tree/branch-0.7/zeppelin-web/src/app/visualization/builtins
Related
I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+
I want to insert several different values in just one cell
E.g.
Friends' names
ID | Grade | Names
----+--------------+----------------------------
1 | elementary | Kai, Matthew, Grace
2 | guidance | Eli, Zoey, David, Nora, William
3 | High school | Emma, James, Levi, Sophia
Or as a list or dictionary:
ID | Grade | Names
----+--------------+------------------------------
1 | elementary | [Kai, Matthew, Grace]
2 | guidance | [Eli, Zoey, David, Nora, William]
3 | High school | [Emma, James, Levi, Sophia]
or
ID | Grade | Names
----+--------------+---------------------------------------------
1 | elementary | { a:Kai, b:Matthew, c:Grace}
2 | guidance | { a:Eli, b:Zoey, c:David, d:Nora, e:William}
3 | High school | { a:Emma, b:James, c:Levi, d:Sophia}
Is there a way?
Yes there is a way, but that doesn't mean you should do it this way.
You could for example save your values as a json string and save them inside the column. If you later want to add a value you can simply parse the json, add the value and put it back into the database. (Might also work with a BLOB, but I'm not sure)
However, I would not recommend saving a list inside of a column, as SQL is not meant to be used like that.
What I would recommend is that you have a table and for every grade with its own primary key. Like this:
ID
Grade
1
Elementary
2
Guidance
3
High school
And then another table containing all the names, having its own primary key and the gradeId as its secondary key. E.g:
ID
GradeID
Name
1
1
Kai
2
1
Matthew
3
1
Grace
4
2
Eli
5
2
Zoey
6
2
David
7
2
Nora
8
2
William
9
3
Emma
10
3
James
11
3
Levia
12
3
Sophia
If you want to know more about this, you should read about Normalization in SQL.
I am new to Python and Stack Overflow. I am trying to copy data from one Excel file to another Excel file using pandas and numpy.
Let's say, the first.csv contains:
ID
Title
Country
Status
Date
Region
1
Project1
US
Active
09/29/20
America
2
Project2
Brazil
Active
America
3
Project3
China
Active
Asia
and the second.csv contains:
ID
Title
Country
Region
Date
Status
Description
1
Project1
US
America
N/A
Active
zzz
4
Project4
Canada
America
N/A
Active
zzz
5
Project5
Africa
Africa
N/A
Active
zzz
In the second file, the Column Status is after Date where as in first file it is after Country.
I want to copy the first.csv data to the second.csv file based on the column structure of the second.csv.
After copying I want my second.csv to look like this.
ID
Title
Country
Region
Date
Status
Description
1
Project1
US
America
N/A
Active
zzz
2
Project2
Brazil
America
N/A
Active
zzzzzzz
3
Project3
China
Asia
N/A
Active
zzzzzzzzzzz
4
Project4
Canada
America
N/A
Active
zzz
5
Project5
Africa
Africa
N/A
Active
zzz
Is there any way to merge/copy the file in this way in Python using numpy and pandas library?
The pandas library makes this easy. Once you have both in memory as data frames, you can just append one to the other. The append will rearrange the columns to match the first df, and just keep empty any columns in csv1_df that aren't in csv2_df.
csv1_df = pd.read_csv('first.csv')
csv2_df = pd.read_csv('second.csv')
combined_df = csv2_df.append(csv1_df, ignore_index=True)
combined_df.to_csv('third.csv', header=True, mode='w')
Given DF1:
Title | Origin | %
Analyst Referral 3
Analyst University 10
Manager University 1
and DF2:
Title | Referral | University
Analyst
Manager
I'm trying set the values inside DF2 based on conditions such as:
DF2['Referral'] = np.where((DF1['Title']=='Analyst') & (DF1['Origin']=='Referral')), DF1['%'], '0'
What I'm getting as a result, is all the values in DF1['%'], and Im expecting to get only the value in the row where the conditions are met.
Like this:
Title | Referral | University
Analyst 3 10
Manager 1
Also, there is probably a more efficient way of doing this, I'm open to suggestions!
just use pivot, no need for logic:
s = """Title|Origin|%
Analyst|Referral|3
Analyst|University|10
Manager|University|1"""
df = pd.read_csv(StringIO(s), sep='|')
df.pivot('Title', 'Origin', '%')
Origin Referral University
Title
Analyst 3.0 10.0
Manager NaN 1.0
I have a pseudo CSV file (separated instead with pipes); there are two columns, the first column header is location and has no relevance to the problem. The second of the two columns is an identifier (in this case username). The file looks something like this
Location | Username
San Francisco, CA | sam001040
Chicago, IL | tinytom
New York City, NY | coder23
Palo Alto, CA | sam001040
As you can notice, sam001040, am seen in two cities (San Francisco & Palo Alto).
I need to assign a unique identification number to the username and create a new similarly formatted table with the new id number. The mappings (username -> id) should be stored to disk. The mappings should be stored, because if in a few days I might need to process another file, so I can reuse the previously stored mappings.
So after the id process the file should look like
Location | Username | UniqueID
San Francisco, CA | sam001040 | 0
Chicago, IL | tinytom | 1
New York City, NY | coder23 | 2
Palo Alto, CA | sam001040 | 0
A few days later a file like this can come in
Location | Username
Grand Rapids, MI | gowolves
Chicago, IL | ill
Los Angeles, CA | trojans
Castro Valley, CA | coder23
Since there are some new usernames, new identifiers need to be created and one we saw from the last time. So the new file outputted should look like this
Location | Username | UniqueID
Grand Rapids, MI | gowolves | 3
Chicago, IL | illini | 4
Los Angeles, CA | trojans | 5
Castro Valley, CA | coder23 | 2
Here is a link to the code, there are some comments and hopefully the names are helpful, but I can clarify anything.
A couple caveats
The file I am manipulating is 1.3gb, approximately 20,000,000 rows with about 30% duplication in usernames (translating to 14,000,000 keys in the dictionary)
Currently only have access to my local machine (MBP, 8 gb ram, 512 flash memory)
Additional Info / What I've tried so far
Initially I used for loops in python, then realized that's not good practice, switched over to pandas dataframes accordingly and used lambdas
Was writing to another file, then decided to print to console and redirected to another file (using >)
Tried to process the file as a whole, which always caused something to break and once used up 500 gb of memory (don't know how that happened)
Broke up the large 1.3 gb file into 50 smaller ones, each one takes ~3 hrs to process
Tried pickling before, then switched to json to store the dictionary after reading Pickle vs. Json (link in comments)
I ran a profiler (SnakeViz) and here are the results. From my understanding it seems like checking the dictionary for keys is taking up time, but from my understanding after reading another stackoverflow post "in" the generally the fastest (Most efficient method to check if dictionary key exists and process its value if it does)
Main Question -
Am I doing something completely wrong? I've spent the entire week looking at this and am not sure what more to do. I didn't think it would take on the scope of ~150 hours to process everything.
If anyone has any suggestions or different ideas, please let me know! This is my first post, so if I need to include more info (or remove some) I apologize in advanced and will adjust the post accordingly.
In general, when checking if a key is in a dictionary, do k in d, not k in d.items() which is dramatically slower, e.g.
In [68]: d = {x:x+1 for x in range(100000)}
In [69]: %timeit (67 in d)
10000000 loops, best of 3: 39.2 ns per loop
In [70]: %timeit (67 in d.items())
100 loops, best of 3: 10.8 ms per loop
That alone would make a big difference. But, I would use a pattern more like this, which should speed things up some more. .map looks up the id for existing users, and the .unique() gets the set of new usernames (filtering to those not matched in the lookup table).
df['UserId'] = df['Username'].map(segment_dict)
new_users = df[pd.isnull(df['UserId'])]['Username'].unique()
for u in new_users:
segment_dict[u] = unique_ids
unique_ids += 1
You could try keeping the User -> ID mapping in CSV for use in pandas.
Assuming you have a CSV file mapping known usernames to IDs:
$ cat ids.csv
sam001040,0
tinytom,1
coder23,2
And a new file newfile.txt that you need to process:
$ cat newfile.txt
Location | Username
Grand Rapids, MI | gowolves
Chicago, IL | ill
Los Angeles, CA | trojans
Castro Valley, CA | coder23
You read in ids.csv:
ids = pd.read_csv('ids.csv', header=None, index_col=0, names=['Username', 'ID'])
and newfile.txt:
newfile = pd.read_csv('newfile.txt', sep=' \| ', skipinitialspace=True)
# or pd.read_csv('newfile.txt', sep='|'), which is faster, but won't work nice
# when the file has spaces like you show
Now you can do:
newfile_with_ids = newfile.merge(ids, left_on='Username', right_index=True, how='left')
All known IDs are already filled in:
Location Username ID
0 Grand Rapids, MI gowolves NaN
1 Chicago, IL ill NaN
2 Los Angeles, CA trojans NaN
3 Castro Valley, CA coder23 2
Now, add new IDs:
mask = newfile_with_ids['ID'].isnull()
ids = pd.concat([ids, pd.DataFrame(
data={'ID': 1 + int(ids.iloc[-1]) + np.arange(mask.sum())},
index=newfile_with_ids.loc[mask, 'Username'].drop_duplicates())])
to get:
ID
Username
sam001040 0
tinytom 1
coder23 2
gowolves 3
ill 4
trojans 5
Then write new IDs to the dataframe:
newfile_with_ids.loc[mask, 'ID'] = ids.loc[
newfile_with_ids.loc[mask, 'Username'], 'ID'].values
And finally you have:
Location Username ID
3 Castro Valley, CA coder23 2
0 Grand Rapids, MI gowolves 3
1 Chicago, IL ill 4
2 Los Angeles, CA trojans 5
Finally, save the new ids back and continue.