I have a data set that is dealing with flight data. i am trying to predict wether a flight will be delayed or not. However, i am getting stuck because I have two non numeric columns. One is the Destination column which is a city code and the other is the airline code.
There is 155 different destinations, and i don't want to add 154 columns to make it binary.
Is it possible to have the models ignore the columns without deleting them?
Related
I have a dataframe in pyspark (and databricks) with the following schema structure:
orders schema:
submitted_at:timestamp
submitted_yyyy_mm using the format "yyyy-MM"
order_id:string
customer_id:string
sales_rep_id:string
shipping_address_attention:string
shipping_address_address:string
shipping_address_city:string
shipping_address_state:string
shipping_address_zip:integer
ingest_file_name:string
ingested_at:timestamp
I need to capture the data in my table in delta lake format, with a partition for every month of the order history reflected in the data of the submitted_yyyy_mm column. I am capturing the data correctly with the exception of two problems. One, my technique is adding two columns (and corresponding data) to the schema (could not figure out how to do the partitioning without adding columns). Two, the partitions correctly capture all the year/months with data, but are missing the year/months without data (requirement is those need to be included also). Specifically, all the months of 2017-2019 should have their own partition (so 36 months). However, my technique only created partitions for those months that actually had orders (which turned out to be 18 of the 36 months of the years 2017-2019).
Here is relevant are of my code:
# take the pristine order table and add these two extra columns you should not have in order to get partition structure
df_with_year_and_month = (df_orders
.withColumn("year", F.year(F.col("submitted_yyyy_mm").cast(T.TimestampType())))
.withColumn("month", F.month(F.col("submitted_yyyy_mm").cast(T.TimestampType()))))
# capture the data to the orders table using the year/month partitioning
df_with_year_and_month.write.partitionBy("year", "month").mode("overwrite").format("delta").saveAsTable(orders_table)
I would be grateful to anyone who might be able to help me tweak my code to fix the two issues I have the result. Thank you
There's no issue here. That's just how it works.
You want to partition on year and month. So you should have those values in you data, no way around it. You should also only partition on values where you want to filter on, since this 'causes partition pruning and results in faster queries. It would make no sense to partition on a field without related value.
Also it's totally normal that you don't create partitions where you don't have data for them. Once data is added, the corresponding partition is created if it doesn't exist yet. You don't need it any sooner than that.
My supervisor wants a single table comparing multiple different categorical variables against another categorical variable. For example, the attached image
x-tab cross table
is found here https://strengejacke.wordpress.com/2014/02/20/no-need-for-spss-beautiful-output-in-r-rstats/ is made from R sjt.xtab() [though the function name has since changed].
I could use sjt.xtab() to create another cross-table with different index variables, for example age category (0-15, 16-29, and etc) with the same column variables (dependency level). What I need to be able to do is combine both of these crosstables into one table where the column categories stay in the same position and several different categorical variables (sex, age categories, shoe size, and etc) are listed in the index. This doesn't seem statistically correct as it would appear to duplicate numbers, but my supervisor just wants it for referencing reasons not publication.
Is there any way to do this in R or python? Happy to clarify my question if needed!
Edit, Here is a terribly edited Microsoft Paint example of what I am looking for Combined cross-tab Image
You may do that with GT tables: https://gt.rstudio.com/articles/intro-creating-gt-tables.html
Currently building a simple Customer Lifetime Value calculator program for marketing purposes. For a portion of the program, I give the user the option to import a CSV file via pd.read_csv to allow calculations across multiple customer records. I designate the required order of the CSV data in notes included in the output window.
The imported CSV should have 4 inputs per row. Building off of this, I want to create a new column in the dataframe that multiplies columns 1-4. Operating under the assumption that some users will include headers (that will vary per user) while others will not, is there a way I can create the new calculated column based on column # rather than header?
Beginner here. None of the answers I have found have worked for me/been similar to my situation.
For instance, column x has 50 values and all of these values are the same.
Is it a good idea to delete variables like these for building machine learning models? If so, how can I spot these variables in a large data set?
I guess a formula/function might be required to do so. I am thinking of using nunique that can take account of the whole dataset.
You should be deleting such columns because it will provide no extra information about how each data point is different from another. It's fine to leave the column for some machine learning models (due to the nature of how the algorithms work), like random forest, because this column will actually not be selected to split the data.
To spot those, especially for categorical or nominal variables (with fixed number of possible values), you can count the occurrence of each unique value, and if the mode is larger than a certain threshold (say 95%), then you delete that column from your model.
I personally will go through variables one by one if there aren't any so that I can fully understand each variable in the model, but the above systematic way is possible if the feature size is too large.
I have used Python Faker for generating fake data. But I need to know what is the maximum number of distinct fake data (eg: fake names) can be generated using faker (eg: fake.name() ).
I have generated 100,000 fake names and I got less than 76,000 distinct names. I need to know the maximum limit so that I can know how much we can scale using this package for generating data.
I need to generate huge dataset. I also want to know is Php faker, perl faker are all same for different environments?
Other packages for generating huge dataset will be highly appreciated.
I had this same issue and looked more into it.
In the en_US provider there about 1000 last names and 750 first names for about 750000 unique combos. If you randomly select a first and last name, there is a chance you'll get duplicates. But in reality, that's how the real world works, there are many John Smiths and Robert Doyles out there.
There are 7203 first names and 473 last names in the en profile which can kind of help. Faker chooses the combo of first name and last name meaning there are about 7203 * 473 = 3407019.
But still, there is a chance you'll get duplicates.
I solve this problem by adding numbers to names.
I need to generate huge dataset.
Keep in mind that in reality, any huge dataset of names will have duplicates. I work with large datasets (> 1 million names) and we see a ton of duplicate first and last names.
If you read the faker package code, you can probably figure out how to modify it so you get all 3M distinct names.