This is my customer dataset
ID Address Location Age
1 room no5,
rose apt,hill street, Nagpur 38
nagpur-"500249"
2 block C-4,kirti park, Thane 26
Thane-"400356"
3 Dream villa, noor nagar, 46
Colaba-"400008"
5 Sita Apt,Room no 101, Kalyan 55
Kalyan- "421509"
7 Rose Apt, room no 20, Mumbai 43
Antop hill,
Mumbai-"400007"
8 Dirk Villa,nouis park, Mumbai 50
Dadar-"400079"
9 Raj apt,room no-2, 33
powai,"400076"
As in above case I have Address column values and corresponding values for Location column. Now given the above training data(data which will have all Address and Location values present) and test data( data which will consist only Address values for which I would be wanting to predict Location for it) I want to forecast which Address will fall under what loaction.
For this problem I want to know which approach in Python would suit the best,here are some reference which I had came across, but do not know which to follow.
here are some linkslink_1
link_2
Any suggestion will be much more appreciated.Thanks
Related
Mathew Jim 60
Gerry Hagger 61
Sam Page 23
Azli Muzan 52
David Agor 32
James Paine 40
Mike Gregor 63
Howard Jack 56
I have this data in CSV file with names and age in one column and would like a python script that will separate the names from the age, create a new CSV with the heights of the individuals classified as short, medium and tall.
Height Classification
<5 short
5-6 medium
6> tall
Will like two column headings of Names and Classification.
I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.
After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)
To whom it may concern,
I have a very large dataframe (MasterDataFrame) that contains ~180K groups that I would like to split into 5 smaller DataFrames and process each smaller DataFrame separately. Does anyone know of any way that I could achieve this split into 5 smaller DataFrames without accidentally splitting/jeopardizing the integrity of any of the groups from the MasterDataFrame? In other words, I would like for the 5 smaller DataFrames to not have overlapping groups.
Thanks in advance,
Christos
This is what my dataset looks like:
|======MasterDataset======|
Name Age Employer
Tom 12 Walmart
Nick 15 Disney
Chris 18 Walmart
Darren 19 KMart
Nate 43 ESPN
Harry 23 Walmart
Uriel 24 KMart
Matt 23 Disney
. . .
. . .
. . .
I need to be able to split my dataset such that the groups shown in the MasterDataset above are preserved. The smaller groups into which my MasterDataset will be split need to look like this:
|======SubDataset1======|
Name Age Employer
Tom 12 Walmart
Chris 18 Walmart
Harry 23 Walmart
Darren 19 KMart
Uriel 24 KMart
|======SubDataset2======|
Name Age Employer
Nick 15 Disney
Matt 23 Disney
I assume that you mean the number of lines with "groups"
For that .iloc should be perfect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
df_1 = df.iloc[0:100000,:]
df_2 = df.iloc[100001:200000,:]
....
I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))
I am working on a name matching problem using synthetic data such as
alertname custname
0 wlison wilson
1 dais said
2 4dams adams
3 ad4ms adams
4 ad48s adams
5 smyth smith
6 smythe smith
7 gillan gillan
8 gilen gillan
9 scott-smith scottsmith
10 scott smith scottsmith
11 perrson person
12 persson person
Now I want to apply Knn for this task in unsupervised way, since I do not have any explicit label. I want to output matching score for each of the rows. I have used fuzzy matching already, now just wanted to explore knn for some automation. Would really appreciate if someone can provide starting point. Having said that, we do not have external label here.