Read a CSV File and create new csv and columns - python

Mathew Jim 60
Gerry Hagger 61
Sam Page 23
Azli Muzan 52
David Agor 32
James Paine 40
Mike Gregor 63
Howard Jack 56
I have this data in CSV file with names and age in one column and would like a python script that will separate the names from the age, create a new CSV with the heights of the individuals classified as short, medium and tall.
Height Classification
<5 short
5-6 medium
6> tall
Will like two column headings of Names and Classification.

Related

Combining three datasets removing duplicates

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.
After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

Splitting DataFrame and maintaining DataFrame group integrity

To whom it may concern,
I have a very large dataframe (MasterDataFrame) that contains ~180K groups that I would like to split into 5 smaller DataFrames and process each smaller DataFrame separately. Does anyone know of any way that I could achieve this split into 5 smaller DataFrames without accidentally splitting/jeopardizing the integrity of any of the groups from the MasterDataFrame? In other words, I would like for the 5 smaller DataFrames to not have overlapping groups.
Thanks in advance,
Christos
This is what my dataset looks like:
|======MasterDataset======|
Name Age Employer
Tom 12 Walmart
Nick 15 Disney
Chris 18 Walmart
Darren 19 KMart
Nate 43 ESPN
Harry 23 Walmart
Uriel 24 KMart
Matt 23 Disney
. . .
. . .
. . .
I need to be able to split my dataset such that the groups shown in the MasterDataset above are preserved. The smaller groups into which my MasterDataset will be split need to look like this:
|======SubDataset1======|
Name Age Employer
Tom 12 Walmart
Chris 18 Walmart
Harry 23 Walmart
Darren 19 KMart
Uriel 24 KMart
|======SubDataset2======|
Name Age Employer
Nick 15 Disney
Matt 23 Disney
I assume that you mean the number of lines with "groups"
For that .iloc should be perfect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
df_1 = df.iloc[0:100000,:]
df_2 = df.iloc[100001:200000,:]
....

dataframe one row save to excel sheet

i'd like to ask a simple question. I want to save the last row of a dataframe to an excel sheet last rows without colname of dataframe.
Dataframe:
date Name Age
2019/10/1 Kate 18
2019/10/2 Jim 20
2019/10/3 James 23
excel sheet:
date Name Age
2019/9/29 Rose 18
2019/9/30 Eva 20
I want to add dataframe values to excel last row,something like this
excel sheet new:
date Name Age
2019/9/29 Rose 18
2019/9/30 Eva 20
2019/10/1 Kate 18
2019/10/2 Jim 20
2019/10/3 James 23
code:
app = xw.App(visible=False, add_book=False)
wb=app.books.open(path_xl)
sht=wb.sheets[sht_name]
rng=sht.range("A1")
last_rows=rng.current_region.rows.count
sht.range('A'+str(last_rows+1)).value=df.values
but the result saved in excel sheet is wrong, i dont know why

How to avoid using iloc or hard coding the index number pandas to dynamically fetch rows from single data frame into multiple subsets?

My dataframe looks likes this
country1 state1 city1 District1
india 36 20 40
china 27 21 35
honkong 34 21 38
london 32 21 38
company technology car brand population
adf java Ford 40
ydfh java Hyundai 19
klyu java Nissan 47
hy6g dotnet Toyota 20
rghtr dotnet Hyundai 30
htryr dotnet hummer 12
I wanted to create a multiple subset from single dataframe, I do not wanted to use index number or iloc function or hard coding the index number because it will filter out whenver there is new entry either after entry london or after last entry
If there is any new entry comes it should also needs to be captured, any clues how to perform in pandas or using numpy?
hope this question is clear
Assuming your data frame is saved as df you can use groupby and save the grouped sub-data to a dictionary for future reference.
d = {}
for group, frame in df.groupby('country1'):
d[group] = frame
Also if you want to groupby multiply columns pass a list to groupby as follows
for group, frame in df.groupby(['country1', 'technology']):
d[group] = frame

Text Analytics/ Text prediction approach in python

This is my customer dataset
ID Address Location Age
1 room no5,
rose apt,hill street, Nagpur 38
nagpur-"500249"
2 block C-4,kirti park, Thane 26
Thane-"400356"
3 Dream villa, noor nagar, 46
Colaba-"400008"
5 Sita Apt,Room no 101, Kalyan 55
Kalyan- "421509"
7 Rose Apt, room no 20, Mumbai 43
Antop hill,
Mumbai-"400007"
8 Dirk Villa,nouis park, Mumbai 50
Dadar-"400079"
9 Raj apt,room no-2, 33
powai,"400076"
As in above case I have Address column values and corresponding values for Location column. Now given the above training data(data which will have all Address and Location values present) and test data( data which will consist only Address values for which I would be wanting to predict Location for it) I want to forecast which Address will fall under what loaction.
For this problem I want to know which approach in Python would suit the best,here are some reference which I had came across, but do not know which to follow.
here are some linkslink_1
link_2
Any suggestion will be much more appreciated.Thanks

Categories

Resources