Here is sample 1 :
| district_id | date |
| -------- | ----------- |
| 18 | 1995-03-24 |
| 1 | 1993-02-26 |
Sample 2:
| link_id | type |
| -------- | ----------- |
| 9 | gold |
| 19 | classic |
I want to gather sample 1's date column and sample 2's type column and output them as data.csv
You can use vertical concatenation of dataframes and then render it:
df3 = pandas.concat([df1['date'], df2['type']], axis = 1)
df3.to_csv('data.csv')
I have a csv file with data which looks like below when seen in a notepad:
| Column A | Column B | Column C | Column D |
---------------------------------------------------
| "100053 | \"253\" | \"Apple\"| \"2020-01-01\" |
| "100056 | \"254\" | \"Apple\"| \"2020-01-01\" |
| "100063 | \"255\" | \"Choco\"| \"2020-01-01\" |
I tried this:
df = pd.read_csv("file_name.csv", sep='\t', low_memory=False)
But the output I'm getting is
| Column A | Column B | Column C | Column D |
-------------------------------------------------------------
| 100053\t\253\" | \"Apple\" | \"2020-01-01\"| |
| 100056\t\254\" | \"Apple\" | \"2020-01-01\"| |
| 100063\t\255\" | \"Choco\" | \"2020-01-01\"| |
How can I get the output in proper format in the respective columns with all the extra characters removed?
I have tried different variations of delimiter, escapechar.. but no luck. Maybe I'm missing something?
Edit: I figured out how to get rid of the external characters
df["ColumnB"]=df["ColumnB"].map(lambda x: str(x)[2:-2])
The above gets rid of the leading 2 characters and the trailing 2 characters.
I have a dataframe which looks like this:
|--------------------------------------|---------|---------|
| path | content|
|------------------------------------------------|---------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 |
|------------------------------------------------|---------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 |
|------------------------------------------------|---------|
I want to split the column values in path by "/" and get the values only until /root/path/mainfolder1
The Output that I want is
|--------------------------------------|---------|---------|---------------------------|
| path | content| root_path |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder1/path1.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path2.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
| /root/path/main_folder1/folder2/path3.txt | Val 1 | /root/path/main_folder1 |
|------------------------------------------------|---------|---------------------------|
I know that I have to use withColumn split and regexp_extract but I am not quiet getting how to limit the output of regexp_extract.
What is it that I have to do to get the desired output?
You can use a regular expression to extract the first three directory levels.
df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
.show(truncate=False)
Output:
+-----------------------------------------+-------+-----------------------+
|path |content|root_path |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2 |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3 |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+
I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
I am using the Agate library to create a table.
Using the command as :
table = agate.Table(cpi_rows, cpi_types, cpi_titles)
Sample values are as below :
cpi_rows[0]
[1.0,'Denmark','DNK',128.0,'EU',1.0,91.0,7.0,2.2,87.0,95.0,83.0,98.0,0.0,97.0,0.0,96.0,98.0,0.0,87.0,89.0,88.0,83.0,0.0,0.0,0.0]
cpi_tiles
['Country Rank','Country / Territory','WB Code','IFS Code','Region','Country Rank','CPI 2013 Score', 'Surveys Used','Standard Error', '90% Confidence interval Lower', 'Upper','Scores range MIN','MAX','Data sources AFDB','BF (SGI)','BF (BTI)','IMD','ICRG','WB','WEF','WJP','EIU','GI','PERC','TI','FH']
When I run the command, I am getting the error as :
ValueError: Column names must be strings or None.
Though all the names in cpi_titles are type strings only, I am unable to get the cause for error.
Just tried your code, and apart from a few corrections to names and stuff this worked without a problem
cpi_rows = [[]]
cpi_rows[0] =[1.0,'Denmark','DNK',128.0,'EU',1.0,91.0,7.0,2.2,87.0,95.0,83.0,98.0,0.0,97.0,0.0,96.0,98.0,0.0,87.0,89.0,88.0,83.0,0.0,0.0,0.0]
cpi_titles = ['Country Rank','Country / Territory','WB Code','IFS Code','Region','Country Rank','CPI 2013 Score', 'Surveys Used','Standard Error', '90% Confidence interval Lower', 'Upper','Scores range MIN','MAX','Data sources AFDB','BF (SGI)','BF (BTI)','IMD','ICRG','WB','WEF','WJP','EIU','GI','PERC','TI','FH']
table = agate.Table(cpi_rows, cpi_titles)
print table.print_structure()
The output is
| column | data_type |
| ----------------------------- | --------- |
| Country Rank | Number |
| Country / Territory | Text |
| WB Code | Text |
| IFS Code | Number |
| Region | Text |
| Country Rank_2 | Number |
| CPI 2013 Score | Number |
| Surveys Used | Number |
| Standard Error | Number |
| 90% Confidence interval Lower | Number |
| Upper | Number |
| Scores range MIN | Number |
| MAX | Number |
| Data sources AFDB | Number |
| BF (SGI) | Number |
| BF (BTI) | Number |
| IMD | Number |
| ICRG | Number |
| WB | Number |
| WEF | Number |
| WJP | Number |
| EIU | Number |
| GI | Number |
| PERC | Number |
| TI | Number |
| FH | Number |
Obviously, I don't have your definition of types which you want to apply to this data. The only other thing to note is that you have defined Country Rank twice in your column titles so Agate does warn you about this and relabel it.