Original file
Data source
Output
My code is as follows.
import pandas as pd
file_dest = r"C:\Users\user\Desktop\Book1.csv"
# read csv data
book=pd.read_csv(file_dest)
file_source = r"C:\Users\user\Desktop\Book2.csv"
materials=pd.read_csv(file_source)
Right_join = pd.merge(book,
materials,
on ='Name',
how ='left')
Right_join.to_csv(file_dest, index=False)
However, the output is as follows, which looks like it just copied the contents but didn't use Vlookup to insert the data. I had tried it with different kinds of data. The results are all the same (which looks like it just copied the contents). Please help me find out the bugs.
Since column names are different in each data source, you have to specify columns to join on in the left and right dataframes. Try this:
# assuming materials is your data source with Price column
joined = book.merge(materials,
left_on="Custmor",
right_on="Name",
how ='left')
Related
I have an excel file with around 50 sheets. All the data in the excel file looks like below:
I want to read the first row and all the rows after 'finish' in the first column.
I have written my script as something like this.
df = pd.read_excel('excel1_all_data.xlsx')
df = df.head(1).append(df[df.index>df.iloc[:,0] == 'finish'].index[0]+1])
The output looks like below:
The start and finish are gone.
My question is - How can I iterate through all the sheets in a similar way and append them into one dataframe? Also have a column which is the sheet name please.
The data in other sheets is similar too, but will have different dates and Names. But start and finish will still be present and we want to get everything after 'finish'.
Thank you so much for your help !!
Try this code and let me know if if it works for you :
import pandas as pd
wbSheets = pd.ExcelFile("excel1_all_data.xlsx").sheet_names
frames = []
for st in wbSheets:
df = pd.read_excel("excel1_all_data.xlsx",st)
frames.append(df.iloc[[0]])
frames.append(df[5:])
res = pd.concat(frames)
print(res)
The pd.ExcelFile("excel1_all_data.xlsx").sheet_names is what will get you the sheet you need to iterate.
In Pandas.read_excel's documentation! you'll find that you can read a specific sheet of the workbook. Which I've used in the for loop.
I don't know if concat is the best way to solve this for huge files but it worked fine on my sample.
please see attached photo
here's the image
I only need to import a specific column with conditions(such as specific data found in that column). And also, I only need to remove unnecessary columns. dropping them takes too much code. What specific code or syntax is applicable?
How to get a column from pandas dataframe is answered in Read specific columns from a csv file with csv module?
To quote:
Pandas is spectacular for dealing with csv files, and the following
code would be all you need to read a csv and save an entire column
into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
So in your case, you just save the the filtered data frame in a new variable.
This means you do newdf = data.loc[...... and then use the code snippet from above to extract the column you desire, for example newdf.continent
I have an excel sheet contains the data like the following.
How to handle this in python using pandas?
Typically I wants to plot this data in a graph. And wanted to find the percentage of people who have registered for ANC from the Estimated Number of Annual Pregnancies year-wise across the states.
Any idea would be deeply helpful.
PS: I am using IPython in Ipython notebook in LinuxMint.
I need the data to be indexed like this..
I would recommend you read in the data frame by skipping rows, then create a dictionary to rename your columns.
Something like the following:
df = pd.read_excel(path, skiprows=8)
mydict = {"Original Col1":"New Col Name1", "Original Col2":"New Col Name2"}
df = df.rename(mydict)
I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)
Imagine I am given two columns: a,b,c,d,e,f and e,f,g,h,i,j (commas indicating a new row in the column)
How could I read in such information from excel and put it in an two separate arrays? I would like to manipulate this info and read it off later. as part of an output.
You have a few choices here. If your data is rectangular, starts from A1, etc. you should just use pandas.read_excel:
import pandas as pd
df = pd.read_excel("/path/to/excel/file", sheetname = "My Sheet Name")
print(df["column1"].values)
print(df["column2"].values)
If your data is a little messier, and the read_excel options aren't enough to get your data, then I think your only choice will be to use something a little lower level like the fantastic xlrd module (read the quickstart on the README)