I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?
Related
For each set of coordinates in a pyspark dataframe, I need to find closest set of coordinates in another dataframe
I have one pyspark dataframe with coordinate data like so (dataframe a):
+------------------+-------------------+
| latitude_deg| longitude_deg|
+------------------+-------------------+
| 40.07080078125| -74.93360137939453|
| 38.704022| -101.473911|
| 59.94919968| -151.695999146|
| 34.86479949951172| -86.77030181884766|
| 35.6087| -91.254898|
| 34.9428028| -97.8180194|
And another like so (dataframe b): (only few rows are shown for understanding)
+-----+------------------+-------------------+
|ident| latitude_deg| longitude_deg|
+-----+------------------+-------------------+
| 00A| 30.07080078125| -24.93360137939453|
| 00AA| 56.704022| -120.473911|
| 00AK| 18.94919968| -109.695999146|
| 00AL| 76.86479949951172| -67.77030181884766|
| 00AR| 10.6087| -87.254898|
| 00AS| 23.9428028| -10.8180194|
Is it possible to somehow merge the dataframes to have a result that a has the closest ident from dataframe b for each row in dataframe a:
+------------------+-------------------+-------------+
| latitude_deg| longitude_deg|closest_ident|
+------------------+-------------------+-------------+
| 40.07080078125| -74.93360137939453| 12A|
| 38.704022| -101.473911| 14BC|
| 59.94919968| -151.695999146| 278A|
| 34.86479949951172| -86.77030181884766| 56GH|
| 35.6087| -91.254898| 09HJ|
| 34.9428028| -97.8180194| 09BV|
What I have tried so far:
I have a pyspark UDF to calculate the haversine distance between 2 pairs of coordinates defined.
udf_get_distance = F.udf(get_distance)
It works like this:
df = (df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b,)
))
I'd appreciate any kind of help. Thanks so much
You need to do a crossJoin first. something like this
joined_df=source_df1.crossJoin(source_df2)
Then you can call your udf like you have mentioned, generate rownum based on distance and filter out the close one
from pyspark.sql.functions import row_number,Window
rwindow=Window.partitionBy("latitude_deg_a","latitude_deg_b").orderBy("ABS_DISTANCE")
udf_result_df = joined_df.withColumn(“ABS_DISTANCE”, udf_get_distance(
df.latitude_deg_a, df.longitude_deg_a,
df.latitude_deg_b, df.longitude_deg_b).withColumn("rownum",row_number().over(rwindow)).filter("rownum = 1")
Note: add return type to your udf
Im getting (with an python API) a .csv file from an email attachment that i received in gmail, transforming it into a dataframe to make some dataprep, and saving as .csv on my pc. It is working great, the problem is that i get '\n' on some columns(it came like that from the source attachment).
the code that i used to get the data and transform into dataframe and .csv
r = io.BytesIO(part.get_payload(decode = True))
df = pd.DataFrame(r)
df.to_csv('C:/Users/x.csv', index = False)
Example of df that i get:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b\n' | | | |
| c | e | \r\n' | |
+-------------+----------+---------+----------------------+
example of answer that is correct:
+-------------+----------+---------+----------------------+
| Information | Modified | Created | MD_x0020_Agenda\r\n' |
+-------------+----------+---------+----------------------+
| c | d | f | \r\n' |
| b | c | e | \r\n' |
+-------------+----------+---------+----------------------+
i tried to use the line_terminator. in my mind, if i force it to get only \r\n and not \n, it would work. It didnt.
df.to_csv('C:/Users/x.csv', index = False, line_terminator='\r\n')
can somebody give me a help with that? its really freaking me out, because of that i cant advance at my project. thanks.
Usually, this "\n" appears to mark that sentence is going for next line i.e ‘return’ key, line break.
You can get rid of it just by applying replace('\n', '') on your dataframe:
df = df.replace('\n', '')
For more details on the function, consider checking this specific Pandas documentation
Hope it works.
I mixed the two answers and got the solution, thanks!!!!!
PS: with some research i found that this is a windows/excel issue, when you export .csv it considers \n and \r\n (\r too?) as new row. DataFrame considers only \r\n as new row(when default).
df = pd.read_csv(io.BytesIO(part.get_payload(decode = True)), header=None)
#grab the first row for the header
new_header = df.iloc[0]
#take the data less the header row
df = df[1:]
#set the header row as the df header
df.columns = new_header
#replace the \n wich is creating new lines
df['Information'] = df['Information'].replace(regex = '\n', value = '')
df.to_csv('C:/Users/x.csv', index = False', index = False)
This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 3 years ago.
I have two pyspark dataframes:
1st dataframe: plants
+-----+--------+
|plant|station |
+-----+--------+
|Kech | st1 |
|Casa | st2 |
+-----+--------+
2nd dataframe: stations
+-------+--------+
|program|station |
+-------+--------+
|pr1 | null|
|pr2 | st1 |
+-------+--------+
What i want is to replace the null values in the second dataframe stations with all the column station in the first dataframe. Like this :
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]|
|pr2 | st1 |
+-------+--------------+
I did this:
stList = plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect()
stations = stations.select(
F.col('program')
F.when(stations.station.isNull(), stList).otherwise(stations.station).alias('station')
)
but it gives me an error when doesn't accept python list as a parameter
Thanks for your replies.
I've found the solution by converting the column to pandas.
stList = list(plants.select(F.col('station')).toPandas()['station'])
and then use:
F.when(stations.station.isNull(), F.array([F.lit(x) for x in station])).otherwise(stations['station']).alias('station')
it gives directly an array.
quick work around is
F.lit(str(stList))
this should work.
For better type casting use below mentioned code.
stations = stations.select(
F.col('program'),
F.when(stations.station.isNull(), func.array([func.lit(x) for x in stList]))
.otherwise(func.array(stations.station)).alias('station')
)
Firstly, you can't keep different datatypes in station column, it needs to be consistent.
+-------+--------------+
|program|station |
+-------+--------------+
|pr1 | [st1, st2]| # this is array
|pr2 | st1 | # this is string
+-------+--------------+
Secondly, this should do the trick:
from pyspark.sql import functions as F
# Create the stList as a string.
stList = ",".join(plants.select(F.col('station')).rdd.map(lambda x: x[0]).collect())
# coalesce the variables and then apply pyspark.sql.functions.split function
stations = (stations.select(
F.col('program'),
F.split(F.coalesce(stations.station, F.lit(stList)), ",").alias('station')))
stations.show()
Output:
+-------+----------+
|program| station|
+-------+----------+
| pr1|[st1, st2]|
| pr2| [st1]|
+-------+----------+
I have two columns in a dataframe. Column one is named as previous_code and column two is named as New_code.These columns have values as "PO","GO","RO" etc. These codes have priority for example "PO" has higher Priority compared to "GO".I want to compare values of these two columns and Put the output in new column as "High","Low" and "No Change" incase both the columns have same code. Below is the example of how dataframe looks like
CustID|previous_code|New_code
345. | PO. | GO
367. | RO. | PO
385. |PO. | RO
455. |GO. |GO
Expected output Dataframe
CustID|previous_code|New_code|Change
345. | PO. | GO. | Low
367. | RO. | PO. |High
385. |PO. | RO. |Low
455. |GO. |GO. |No Change
If someone could write a demo code for this in pyspark or Pandasthat will be helpful.
Thanks in advance.
If I understood the ordering correctly, this should work fine:
import pandas as pd
import numpy as np
data = {'CustID':[345,367,385,455],'previous_code':['PO','RO','PO','GO'],'New_code':['GO','PO','RO','GO']}
df = pd.DataFrame(data)
mapping = {'PO':1,'GO':2,'RO':3}
df['previous_aux'] = df['previous_code'].map(mapping)
df['new_aux'] = df['New_code'].map(mapping)
df['output'] = np.where(df['previous_aux'] == df['new_aux'],'No change',np.where(df['previous_aux'] > df['new_aux'],'High','Low'))
df = df[['CustID','previous_code','New_code','output']]
print(df)
Output:
CustID previous_code New_code output
0 345 PO GO Low
1 367 RO PO High
2 385 PO RO Low
3 455 GO GO No change
I am trying to put together a useable set of data about glaciers. Our original data comes from an ArcGIS dataset, and latitude/longitude values were stored in a separate file, now detached from the CSV with all of our data. I am attempting to merge the latitude/longitude files with our data set. Heres a preview of what the files look like.
This is my main dataset file, glims (columns dropped for clarity)
| ANLYS_ID | GLAC_ID | AREA |
|----------|----------------|-------|
| 101215 | G286929E46788S | 2.401 |
| 101146 | G286929E46788S | 1.318 |
| 101162 | G286929E46788S | 0.061 |
This is the latitude-longitude file, coordinates
| lat | long | glacier_id |
|-------|---------|----------------|
| 1.187 | -70.166 | G001187E70166S |
| 2.050 | -70.629 | G002050E70629S |
| 3.299 | -54.407 | G002939E70509S |
The problem is, the coordinates data frame has one row for each glacier id with latitude longitude, whereas my glims data frame has multiple rows for each glacier id with varying data for each entry.
I need every single entry in my main data file to have a latitude-longitude value added to it, based on the matching glacier_id between the two data frames.
Heres what I've tried so far.
glims = pd.read_csv('glims_clean.csv')
coordinates = pd.read_csv('LatLong_GLIMS.csv')
df['que'] = np.where((coordinates['glacier_id'] ==
glims['GLAC_ID']))
error returns: 'int' object is not subscriptable
and:
glims.merge(coordinates, how='right', on=('glacier_id', 'GLAC_ID'))
error returns: int' object has no attribute 'merge'
I have no idea how to tackle this big of a merge. I am also afraid of making mistakes because it is nearly impossible to catch them, since the data carries no other identifying factors.
Any guidance would be awesome, thank you.
This should work
glims = glims.merge(coordinates, how='left', left_on='GLAC_ID', right_on='glacier_id')
This a classic merging problem. One way to solve is using straight loc and index-matching
glims = glims.set_index('GLAC_ID')
glims.loc[:, 'lat'] = coord.set_index('glacier_id').lat
glims.loc[:, 'long'] = coord.set_index('glacier_id').long
glims = glims.reset_index()
You can also use pd.merge
pd.merge(glims,
coord.rename(columns={'glacier_id': 'GLAC_ID'}),
on='GLAC_ID')