I am practicing pandas and I have an exercise with which I have a problem
I have an excel file that has a column where two types of urls are stored.
df = pd.DataFrame({'id': [],
'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| | 'www.something/12312' |
| | 'www.something/12343' |
| | 'www.somethingelse/42312' |
| | 'www.somethingelse/62343' |
I am supposed to transform this into ids, but only number should be part of the id, the new id column should look like this:
df = pd.DataFrame({'id': [id_12312 , id_12343, diffid_42312, diffid_62343], 'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| id_12312 | 'www.something/12312' |
| id_12343 | 'www.something/12343' |
| diffid_42312 | 'www.somethingelse/42312' |
| diffid_62343 | 'www.somethingelse/62343' |
My problem is how to get only numbers and replace them if that kind of id?
I have tried the replace and extract function in pandas
id_replaced = df.replace(regex={re.search('something', df['url']): 'id_' + str(re.search(r'\d+', i).group()), re.search('somethingelse', df['url']): 'diffid_' + str(re.search(r'\d+', i).group())})
df['id'] = df['url'].str.extract(re.search(r'\d+', df['url']).group())
However, they are throwing an error TypeError: expected string or bytes-like object.
Sorry for the tables in codeblock. The page was screaming that I have code that is not properly formatted when it was not in a codeblock.
Here is one solution, but I don't quite understand when do you use the id prefix and when to use diffid ..
>>> df.id = 'id_'+df.url.str.split('/', n=1, expand=True)[1]
>>> df
id url
0 id_12312 www.something/12312
1 id_12343 www.something/12343
2 id_42312 www.somethingelse/42312
3 id_62343 www.somethingelse/62343
Or using str.extract
>>> df.id = 'id_' + df.url.str.extract(r'/(\d+)$')
Related
I have a pandas df with a column that have a mix of values like so
| ID | home_page |
| ---------| ------------------------------------------------|
| 1 | facebook.com, facebook.com, meta.com |
| 2 | amazon.com |
| 3 | twitter.com, dev.twitter.com, twitter.com |
I want to create a new column that contain the unique values from home_page column. The final output should be
| ID | home_page | unique |
| -------- | -------------- |---------------------------|
| 1 | facebook.com, facebook.com, meta.com | facebook.com,meta.com |
| 2 | amazon.com | amazon.com |
| 3 | twitter.com, dev.twitter.com, twitter.com |twitter.com,dev.twitter.com|
I tried the following:
final["home_page"] = final["home_page"].str.split(',').apply(lambda x : ','.join(set(x)))
But when I do that I get
TypeError: float object is not iterable
The column has no NaN but just in case I tried
final["home_page"] = final["home_page"].str.split(',').apply(lambda x : ','.join(set(x)))
But the entire column return empty when doing that
You are right that this is coming from np.nan values which are of type float. The issue happens here: set(np.nan). The following should work for you (and should be faster).
df["home_page"].str.replace(' ', '').str.split(',').apply(np.unique)
If you actually want a string at the end you can throw the following at the end:
.apply(lambda x: ','.join(str(i) for i in x))
I have a csv file with data which looks like below when seen in a notepad:
| Column A | Column B | Column C | Column D |
---------------------------------------------------
| "100053 | \"253\" | \"Apple\"| \"2020-01-01\" |
| "100056 | \"254\" | \"Apple\"| \"2020-01-01\" |
| "100063 | \"255\" | \"Choco\"| \"2020-01-01\" |
I tried this:
df = pd.read_csv("file_name.csv", sep='\t', low_memory=False)
But the output I'm getting is
| Column A | Column B | Column C | Column D |
-------------------------------------------------------------
| 100053\t\253\" | \"Apple\" | \"2020-01-01\"| |
| 100056\t\254\" | \"Apple\" | \"2020-01-01\"| |
| 100063\t\255\" | \"Choco\" | \"2020-01-01\"| |
How can I get the output in proper format in the respective columns with all the extra characters removed?
I have tried different variations of delimiter, escapechar.. but no luck. Maybe I'm missing something?
Edit: I figured out how to get rid of the external characters
df["ColumnB"]=df["ColumnB"].map(lambda x: str(x)[2:-2])
The above gets rid of the leading 2 characters and the trailing 2 characters.
I have a dataframe which looks like this:
|--------------------------------------|---------|---------|
| path | content|
|------------------------------------------------|---------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 |
|------------------------------------------------|---------|
I want to split the column values in path by "/" and get the values only until /root/path/mainfolder1
The Output that I want is
|--------------------------------------|---------|---------|---------------------------|
| path | content| root_path |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
I know that I have to use withColumn split and regexp_extract but I am not quiet getting how to limit the output of regexp_extract.
What is it that I have to do to get the desired output?
You can use a regular expression to extract the first three directory levels.
df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
.show(truncate=False)
Output:
+-----------------------------------------+-------+-----------------------+
|path |content|root_path |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3 |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+
I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object
I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()