I've extracted information from 142 different files, which is stored in CSV-file with one column, which contains both number and text. I want to copy row 11-145, transform it, and paste it into another file (xlsx or csv doesn't matter). Then, I want to skip the next 10 rows, and copy row 156-290, transform and paste it etc etc. I have tried the following code:
import numpy as np
overview = np.zeros((145, 135))
for i in original:
original[i+11:i+145, 1] = overview[1, i+1:i+135]
print(overview)
The original file is the imported file, for which I used pd.read_csv.
pd.read_csv is a function that returns a dataframe.
To select specific rows from a dataframe you can use this function :
df.loc[start:stop:step]
so it would look something like this :
df = pd.read_csv(your_file)
new_df = df.loc[11:140]
#transform it as you please
#convert it to excel or csv
new_df .to_excel("new_file.xlsx") or new_df .to_csv("new_file.csv")
Related
Here is my code:
import pandas as pd
df = pd.read_parquet("file.parqet", engine='pyarrow')
df_set_index = df.set_index('column1')
row_count = df.shape[0]
column_count = df.shape[1]
print(df_set_index)
print(row_count)
print(column_count)
Can I run this without reading in the parquet file each time I want to do a row count, column count, etc? It takes a while to read in the file because it's large and I already read it in once but I'm not sure how to.
pd.read_parquet reads files that are stored on the disc and stores it in cache which is naturally slow with a lot of data. So, you could engineer a solution like:
1.) column_count
pd.read_parquet("file.parqet", engine='pyarrow', nrows=1).shape[1]
-> This would give you the number of columns while only reading in 1 row
-> .shape returns a tuple with values (# rows, # columns), so just grab the second item for the number of columns as demonstrated above.
2.) row_count
cols_want = ['colmn1'] # put whatever column names you want here
row_count = pd.read_parquet("file.parqet", engine='pyarrow', usecols=cols_want).shape[0]
-> This would give you the number of rows in the column "column1" without having to read in all the other columns (which is the reason for your solution taking awhile).
3.) df.set_index(...) isn't meant to be stored in a variable, so I'm not sure what you want to do there. If you're trying to see what is in the column just use #2 above and remove the ".shape[0]" call
import pandas as pd
df = pd.read_excel('Live_Data.xlsm','Sheet1',skiprows = 5, nrows= 21, usecols = 'B:V')
print(df)
This code read the excel file from the 'B' to to 'v' column and prints the output. I am trying to use a loop function to check the excel sheet in some time interval and print the output only if there is a change in value suppose in the 'v' columns data.
You could check timestamp to know if there were any change but it wont tell you what change. The way I see it you need something to compare to so have always to files (an auxiliar one) and upload them and see if are differences
df.equals(df2) # if True then some rows have changed, use pandas to see each one.
if there are differences you backup the file to be the auxiliar file for the next time you want to compare,
Here's my problem, I need to compare two procmon scans which I converted into CSV files.
Both files have identical column names, but obviously the contents differ.
I need to check the "Path" (5th column) from the first file to the one to the second file and print out the ENTIRE row of the second file into a third CSV, if there are corresponding matches.
I've been googling for quite a while and can't seem to get this to work like I want it to, any help is appreciated!
I've tried numerous online tools and other python scripts, to no avail.
Have you tried using pandas and numpy together?
It would look something like this:
import pandas as pd
import numpy as np
#get your second file as a Dataframe, since you need the whole rows later
file2 = pd.read_csv("file2.csv")
#get your columns to compare
file1Column5 = pd.read_csv("file1.csv")["name of column 5"]
file2Column5 = file2["name of column 5"]
#add a column where if values match, row marked True, else False
file2["ColumnsMatch"] = np.where(file1Column5 == file2Column5, 'True', 'False')
#filter rows based on that column and remove the extra column
file2 = file2[file2['ColumnsMatch'] == 'True'].drop('ColumnsMatch', 1)
#write to new file
file2.to_csv(r'file3.csv')
Just write for such things your own code. It's probably easier than you are expecting.
#!/usr/bin/env python
import pandas as pd
# read the csv files
csv1 = pd.read_csv('<first_filename>')
csv2 = pd.read_csv('<sencond_filename>')
# create a comapare series of the files
iseq = csv1['Path'] == csv2['Path']
# push compared data with 'True' from csv2 to csv3
csv3 = pd.DataFrame(csv2[iseq])
# write to a new csv file
csv3.to_csv('<new_filename>')
I have a list of 50 .csv files named data1.csv, data2.csv etc, I would like plot the first row, third column of each of these files. But first I would like to check the 50 values to ensure I'm plotting the correct thing, I have:
import glob
import pandas as pd
files = glob.glob('data*.csv')
for f in sorted(files):
df = pd.read_csv(f)
print(df.iloc[0,2])
The problem here is in the last line, df.iloc[0,2] prints the 3rd column of the LAST row when I want it to print the 3rd column of the FIRST row.
Essentially print(df.iloc[0,2]) prints the same values as print(df.iloc[-1,2]) and I have no idea why.
How can I check what values the first row, third column are in all of my files?
My mistake, pd.read.csv considers headers, but my .csv files have no headers, so we need:
df = pd.read_csv(f,headers=None)
Here the the code to process and save csv file, and raw input csv file and output csv file, using pandas on Python 2.7 and wondering why there is an additional column at the beginning when saving the file? Thanks.
c_a,c_b,c_c,c_d
hello,python,pandas,0.0
hi,java,pandas,1.0
ho,c++,numpy,0.0
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sample.to_csv('saved.csv')
Here is the saved file, there is an additional column at the beginning, whose values are 0, 1, 2.
cat saved.csv
,c_a,c_b,c_c,c_d
0,hello,python,pandas,0
1,hi,java,pandas,1
2,ho,c++,numpy,0
The additional column corresponds to the index of the dataframe and is aggregated once you read the CSV file. You can use this index to slice, select or sort your DF in an effective manner.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
http://pandas.pydata.org/pandas-docs/stable/indexing.html
If you want to avoid this index, you can set the index flag to False when you save your dataframe with the function pd.to_csv. Also, you are removing the header and aggregating it later, but you can use the header of the CSV to avoid this step.
sample = pd.read_csv('123.csv', dtype={0:str, 1:str, 2:str, 3:float})
sample.to_csv('output.csv', index= False)
Hope it helps :)