I have a csv file with data which looks like below when seen in a notepad:
| Column A | Column B | Column C | Column D |
---------------------------------------------------
| "100053 | \"253\" | \"Apple\"| \"2020-01-01\" |
| "100056 | \"254\" | \"Apple\"| \"2020-01-01\" |
| "100063 | \"255\" | \"Choco\"| \"2020-01-01\" |
I tried this:
df = pd.read_csv("file_name.csv", sep='\t', low_memory=False)
But the output I'm getting is
| Column A | Column B | Column C | Column D |
-------------------------------------------------------------
| 100053\t\253\" | \"Apple\" | \"2020-01-01\"| |
| 100056\t\254\" | \"Apple\" | \"2020-01-01\"| |
| 100063\t\255\" | \"Choco\" | \"2020-01-01\"| |
How can I get the output in proper format in the respective columns with all the extra characters removed?
I have tried different variations of delimiter, escapechar.. but no luck. Maybe I'm missing something?
Edit: I figured out how to get rid of the external characters
df["ColumnB"]=df["ColumnB"].map(lambda x: str(x)[2:-2])
The above gets rid of the leading 2 characters and the trailing 2 characters.
Related
I have a pandas df with a column that have a mix of values like so
| ID | home_page |
| ---------| ------------------------------------------------|
| 1 | facebook.com, facebook.com, meta.com |
| 2 | amazon.com |
| 3 | twitter.com, dev.twitter.com, twitter.com |
I want to create a new column that contain the unique values from home_page column. The final output should be
| ID | home_page | unique |
| -------- | -------------- |---------------------------|
| 1 | facebook.com, facebook.com, meta.com | facebook.com,meta.com |
| 2 | amazon.com | amazon.com |
| 3 | twitter.com, dev.twitter.com, twitter.com |twitter.com,dev.twitter.com|
I tried the following:
final["home_page"] = final["home_page"].str.split(',').apply(lambda x : ','.join(set(x)))
But when I do that I get
TypeError: float object is not iterable
The column has no NaN but just in case I tried
final["home_page"] = final["home_page"].str.split(',').apply(lambda x : ','.join(set(x)))
But the entire column return empty when doing that
You are right that this is coming from np.nan values which are of type float. The issue happens here: set(np.nan). The following should work for you (and should be faster).
df["home_page"].str.replace(' ', '').str.split(',').apply(np.unique)
If you actually want a string at the end you can throw the following at the end:
.apply(lambda x: ','.join(str(i) for i in x))
The data frame consists of column 'value' which has some hidden characters.
When I write the data frame to PostgreSQL I get the below error
ValueError: A string literal cannot contain NUL (0x00) characters.
I some how found the cause of error. Refer table below (missing column value
| | datetime | mc | tagname | value | quality |
|-------|--------------------------|----|---------|------------|---------|
| 19229 | 16-12-2021 02:31:29.083 | L | VIN | | 192 |
| 19230 | 16-12-2021 02:35:28.257 | L | VIN | C4A 173026 | 192 |
Checked the length of string- it was same 10 character like below rows
df.value.str.len()
Requirement:
I want to replace that empty area with text 'miss', i tried different method in pandas. I'm not able to do.
df['value'] = df['value'].str.replace(r"[\"\',]", '')<br />
df.replace('\'','', regex=True, inplace=True)
| | datetime | mc | tagname | value | quality |
|-------|--------------------------|----|---------|------------|---------|
| 19229 | 16-12-2021 02:31:29.083 | L | VIN | miss | 192 |
| 19230 | 16-12-2021 02:35:28.257 | L | VIN | C4A 173026 | 192 |
Try this:
df['value'] = df['value'].str.replace(r'[\x00-\x19]', '').replace('', 'miss')
I have a dataframe which looks like this:
|--------------------------------------|---------|---------|
| path | content|
|------------------------------------------------|---------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 |
|------------------------------------------------|---------|
I want to split the column values in path by "/" and get the values only until /root/path/mainfolder1
The Output that I want is
|--------------------------------------|---------|---------|---------------------------|
| path | content| root_path |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
I know that I have to use withColumn split and regexp_extract but I am not quiet getting how to limit the output of regexp_extract.
What is it that I have to do to get the desired output?
You can use a regular expression to extract the first three directory levels.
df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
.show(truncate=False)
Output:
+-----------------------------------------+-------+-----------------------+
|path |content|root_path |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3 |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+
I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()
So I have an excel data like:
+---+--------+----------+----------+----------+----------+---------+
| | A | B | C | D | E | F |
+---+--------+----------+----------+----------+----------+---------+
| 1 | Name | 266 | | | | |
| 2 | A | B | C | D | E | F |
| 3 | 0.1744 | 0.648935 | 0.947621 | 0.121012 | 0.929895 | 0.03959 |
+---+--------+----------+----------+----------+----------+---------+
My main labels are on row 2. but I need to delete the first row. I am using the following Pandas code:
import pandas as pd
excel_file = 'Data.xlsx'
c1 = pd.read_excel(excel_file)
How do I make the 2nd row as my main label row?
You can use the skiprows parameter to skip the top row, also you can read more about the parameters you can use with read_excel on the pandas documentation