So I have an excel data like:
+---+--------+----------+----------+----------+----------+---------+
| | A | B | C | D | E | F |
+---+--------+----------+----------+----------+----------+---------+
| 1 | Name | 266 | | | | |
| 2 | A | B | C | D | E | F |
| 3 | 0.1744 | 0.648935 | 0.947621 | 0.121012 | 0.929895 | 0.03959 |
+---+--------+----------+----------+----------+----------+---------+
My main labels are on row 2. but I need to delete the first row. I am using the following Pandas code:
import pandas as pd
excel_file = 'Data.xlsx'
c1 = pd.read_excel(excel_file)
How do I make the 2nd row as my main label row?
You can use the skiprows parameter to skip the top row, also you can read more about the parameters you can use with read_excel on the pandas documentation
Related
so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.
I have a folder with multiple CSV files. Each CSV file has the same dimensions. They all have 2 columns, and the first column of each is the same. Is there a way to import all the CSVs and concatenate into one Dataframe in which the first file provides the first column along with its second column, and the subsequent files just have their second column of values added next to that? The header of the second column for each file is unique, but they have the same header of the first file.
This would give you a combination of all your file in path folder
you can find all material related to merge or combine df in here
Check out for all sort of combination for df(CSV that you read as df)
import pandas as pd
import os
path='path to folder'
all_files=os.listdir(path)
li = []
for filename in all_files:
df = pd.read_csv(path+filename, index_col='H1')
print(df)
li.append(df)
frame = pd.concat(li, axis=1, ignore_index=False)
frame.to_csv(path+'out.csv')
print(frame)
input files are like:
File1
+----+----+
| H1 | H2 |
+----+----+
| 1 | A |
| 2 | B |
| 3 | C |
+----+----+
File2:
+----+----+
| H1 | H2 |
+----+----+
| 1 | D |
| 2 | E |
| 3 | F |
+----+----+
File13:
+----+----+
| H1 | H2 |
+----+----+
| 1 | G |
| 2 | H |
| 3 | I |
+----+----+
Output is:(saved in out.csv file in same directory)
+----+----+----+----+
| H1 | H2 | H2 | H2 |
+----+----+----+----+
| 1 | A | D | G |
| 2 | B | E | H |
| 3 | C | F | I |
+----+----+----+----+
Here is how I will proceed.
I am assuming that only csv files are present in the folder.
import os
import pandas as pd
files = os.listdir("path_of_the_folder")
dfs = [pd.read_csv(file).set_index('col1') for file in files]
df_final = dfs[0].join(dfs[1:])
I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+
I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.
I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()
I would like to take values from two csv files and put that in a single CSV file.
Please refer the Data in those two csv files:
CSV 1:
| | Status | P | F | B | IP | NI | NA | CO | U |
|---|----------|----|---|---|----|----|----|----|---|
| 0 | Sanity 1 | 14 | | | | | | 1 | |
| 1 | Sanity 2 | 13 | | 1 | | | | 1 | |
| | | | | | | | | | |
CSV 2:
| | Status | P | F | B | IP | NI | NA | CO | U |
|---|------------|-----|---|---|----|----|----|----|---|
| 0 | P0 Dry Run | 154 | 1 | | | 1 | | | 5 |
| | | | | | | | | | |
| | | | | | | | | | |
Code:
I tried with following code:
filenames = glob.glob ("C:\\Users\\gomathis\\Downloads\\To csv\\*.csv")
wf = csv.writer(open("C:\\Users\\gomathis\\Downloads\\To
csv\\FinalTR.csv",'wb'))
for f in filenames:
rd = csv.writer(open(f,'r'))
next(rd)
for row in rd:
wf.writerow(row)
Actual result:
While trying with above code, I didn't get the values from those above CSV files.
Expected result:
I need that two files needs to be added in a single csv file and to be saved locally.
Modified code:
filenames = glob.glob ("C:\\Users\\gomathis\\Downloads\\To csv\\*.csv")
wf = csv.writer(open("C:\\Users\\gomathis\\Downloads\\To csv\\FinalTR.csv",'w'))
print(filenames)
for f in filenames:
rd = csv.reader(open(f,'r', newline=''))
next(rd)
for row in rd:
wf.writerow(row)
Latest result:
I got the below result after modifying the code. And I didn't get the index like status P, F,B,etc. Please refer the latest result.
| 0 | P0 Dry Run - 15/02/18 | 154 | 1 | | | 1 | | | 5 |
|---|--------------------------------|-----|---|---|---|---|---|---|---|
| | | | | | | | | | |
| 0 | Sanity in FRA Prod - 15/02/18 | 14 | | | | | | 1 | |
| | | | | | | | | | |
| 1 | Sanity in SYD Gamma - 15/02/18 | 13 | | 1 | | | | 1 | |
You need to call the csv reader method over your csv files in the loop.
rd = csv.reader(open(f,'r'))
import csv
import glob
dest_fname = "C:\\Users\\gomathis\\Downloads\\To csv\\FinalTR.csv"
src_fnames = glob.glob("C:\\Users\\gomathis\\Downloads\\To csv\\*.csv")
with open(dest_fname, 'w', newline='') as f_out:
writer = csv.writer(fout)
copy_headers = True
for src_fname in src_fnames:
# don't want to overwrite destination file
if src_fname.endswith('FinalTR.csv'):
continue
with open(src_fname, 'r', newline='') as f_in:
reader = csv.reader(f_in)
# header row is copied from first csv and skipped on the rest
if copy_headers:
copy_headers = False
else:
next(reader) # skip header
for row in reader:
writer.writerow(row)
Notes:
Placed your open() in with-statements for automatic closing of files.
Removed the binary flag from file modes and added newline='' which is needed for files passed to csv.reader and csv.writer in Python 3.
Changed from csv.writer to csv.reader for the files you were reading from.
Added a copy_headers flag to enable copying headers from the first file and skipping copying of headers from any files after that.
Check source filename and skip it when it matches the destination filename.