I am trying to accomplish something I thought would be easy: Take three columns from my dataframe, use a label encoder to encode them, and simply replace the current values with the new values.
I have a dataframe that looks like this:
| Order_Num | Part_Num | Site | BUILD_ID |
| MO100161015 | PPT-100K39 | BALT | A001 |
| MO100203496 | MDF-925R36 | BALT | A001 |
| MO100203498 | PPT-825R34 | BALT | A001 |
| MO100244071 | MDF-323DCN | BALT | A001 |
| MO100244071 | MDF-888888 | BALT | A005 |
I am essentially trying to use sklearn's LabelEncoder() to switch my String variables to numeric. Currently, I have a function str_to_num where I feed it a column and it returns me an array (column) of the converted data. It works great.
However, I am struggling to remove the old data from my dataframe and add it to the new. My script is below:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
# Convert the passed in column
def str_to_num(arr):
le = preprocessing.LabelEncoder()
array_of_parts = []
for x in arr:
array_of_parts.append(x)
new_arr = le.fit_transform(array_of_parts)
return new_arr
# read in data from csv
data = pd.read_csv('test.csv')
print(data)
# Create the new data
converted_column = str_to_num(data['Order_Num'])
print(converted_column)
# How can I replace data['Order_Num'] with the values in converted_column?
# Drop the old data
dropped = data.drop('Order_Num', axis=1)
# Add the new_data column to the place where the old data was?
Given my current script, how can I replace the values in the 'Order_Num' column with those in converted_column? I have tried [pandas.DataFrame.replace][1], but that replaces specific values, and I don't know how to map that to the returned data.
I would hope my expected data to be:
| Order_Num | Part_Num | Site | BUILD_ID |
| 0 | PPT-100K39 | BALT | A001 |
| 1 | MDF-925R36 | BALT | A001 |
| 2 | PPT-825R34 | BALT | A001 |
| 3 | MDF-323DCN | BALT | A001 |
| 3 | MDF-888888 | BALT | A005 |
My python --version returns
3.6.7
The beauty of pandas is sometimes understated - often you only need to do something like this:
data['Order_Num'] = str_to_num(data['Order_Num'])
There's also the option of df.apply()
Related
so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.
I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |
Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0
I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object
So I have an excel data like:
+---+--------+----------+----------+----------+----------+---------+
| | A | B | C | D | E | F |
+---+--------+----------+----------+----------+----------+---------+
| 1 | Name | 266 | | | | |
| 2 | A | B | C | D | E | F |
| 3 | 0.1744 | 0.648935 | 0.947621 | 0.121012 | 0.929895 | 0.03959 |
+---+--------+----------+----------+----------+----------+---------+
My main labels are on row 2. but I need to delete the first row. I am using the following Pandas code:
import pandas as pd
excel_file = 'Data.xlsx'
c1 = pd.read_excel(excel_file)
How do I make the 2nd row as my main label row?
You can use the skiprows parameter to skip the top row, also you can read more about the parameters you can use with read_excel on the pandas documentation
I have a table that needs to be split into multiple files grouped by values in column 1 - serial.
+--------+--------+-------+
| serial | name | price |
+--------+--------+-------+
| 100-a | rdl | 123 |
| 100-b | gm1 | -120 |
| 100-b | gm1 | 123 |
| 180r | xxom | 12 |
| 182d | data11 | 11.50 |
+--------+--------+-------+
the output would be like this:
100-a.xls
100-b.xls
180r.xls etc.etc.
and opening 100-b.xls cotains this:
+--------+------+-------+
| serial | name | price |
+--------+------+-------+
| 100-b | gm1 | -120 |
| 100-b | gm1 | 123 |
+--------+------+-------+
I tried using Pandas to define the dataframe by using this code:
import pandas as pd
#from itertools import groupby
df = pd.read_excel('myExcelFile.xlsx')
I was successful in getting the data frame, but i have no idea what to do next. I tried following this similar question on Stackoverflow, but the scenario is a bit different. What is the next approach to this?
This is not a groupby but a filter.
You need to follow 2 steps :
Generate the data that you need in the excel file
Save dataframe as excel.
Something like this should do the trick -
for x in list(df.serial.unique()) :
df[df.serial == x].to_excel("{}.xlsx".format(x))