Removing the index when appending data and rewriting CSV using pandas [duplicate] - python

This question already has answers here:
How to get rid of "Unnamed: 0" column in a pandas DataFrame read in from CSV file?
(11 answers)
Closed 1 year ago.
I have a script that runs on a daily basis to collect data.
I record this data in a CSV file using the following code:
old_df = pd.read_csv('/Users/tdonov/Desktop/Python/Realestate Scraper/master_data_for_realestate.csv')
old_df = old_df.append(dataframe_for_cvs, ignore_index=True)
old_df.to_csv('/Users/tdonov/Desktop/Python/Realestate Scraper/master_data_for_realestate.csv')
I am using append(ignore_index=True), but after every run of the code I still get additional columns created at the start of my CSV. I delete them manually, but is there a way to stop them from the code itself? I looked the function but I am still not sure if it is possible.
My result file gets the following columns added after every run (one at a time, after each run):
This is really annoying to have to delete everytime.
Update: Data looks like that:
However the id is not unique. Every day it can be repeated. In my case it is not unique. This is an id of an online offer. The offer can be available for one day or for 5 months, or couple of days.

Did you try
to_csv(index=False)

Related

Why does looping over tweepy data seems do delete the data? [duplicate]

This question already has answers here:
Resetting generator object in Python
(19 answers)
Closed 7 months ago.
I'm using the Twitte API (with Tweepy) to extract a number of tweets via Python.
I'm looping
tweets = tweepy.Cursor(api.search,
q=((search_term) ),
since = str(t2)).items(10)
After I get the tweets, I run through a loop that puts the data within a dataframe:
However, when I run the code again, data has seemed to dissapear:
Is there something I could be doing differently? My purpose is to continue adding columns to the dataframe from the same tweet data, but since the data appears to dissapear after the first loop, I can't get it done.
Thanks in advance.
The items property of class tweepy.Cursor is not a list, but an iterator, see documentation.

Filtering data on multiple csv files - python - pandas

This is my first question on stackoverflow. I have just started to learn python 2 months ago.
I had a look on this site and others, but I can't find a solution to my problem.
I'm trying to speed up an annoying data filtering task I have to do everytime for my job.
I want to use pandas library to read multiple .csv files (12 to be precise) and assign for each one a variable (df_1, df_2,...,df_12) that correspond to a new filtered dataframe.
Each .csv file contain the raw data of a tensile test from one of the company Instron machine we have in the Lab.
Example:
first .csv file with raw data, first 9 rows
I will use the filtered dataframe to do some other analysis with Minitab software.
This is what I managed to do so far:
import pandas as pd
dataset_1 = pd.read_csv('Specimen_RawData_1.csv')
df_1 = pd.DataFrame({'X': dataset_1.iloc[1:,-1].values,
'y': dataset_1.iloc[1:, 2].values,
})
df_1 = df_1.loc[df_1['X'].isin(['1.0', '2.0', '3.0', '4.0', '5.0'])]
The code will take the last column and assign to X, and take the third column and assign to y.
X will be then filtered asking to keep only the value equal to 1,2,3,4,5.
This works for the first .csv file. I can copy and paste it 12 times, but I thought that using a list or a dictionary may help instead.
I understand I can't create variables in a loop.
I have failed so far because the dictionary I have created takes the variables as strings so I can't use them for data analysis.
Any idea, please?
Thank You

Is there a way to loop through rows of data in excel with python until an empty cell is reached? [duplicate]

This question already has answers here:
How to find the last row in a column using openpyxl normal workbook?
(4 answers)
Closed 3 years ago.
I am working with a large excel chart. For each row of data I need to perform several tasks. Is there a way to construct a loop in python to run through each line until an empty cell is found?
For example:
Project1 Data Data Data
Project2 Data Data Data
Project3 Data Data Data
Project4 Data Data Data
In this scenario, I would want to run through the chart until after Project4. But different documents will have various sized charts so it will need to run until it hits an empty cell, not limited by a specific cell.
I am thinking a Do until (as you can tell I don't know python very well) type loop would be useful. I also know there is a way to attempt empty cells via openpyxl which I am using for this project.:
if sheet.cell(0, 0).value == xlrd.empty_cell.value:
# Do something
Currently, I would try to figure out a way to do something similar to this, unless someone suggests a better alternative:
For i=10 to 1000 in range:
#setting an upper limit of 1000 rows
if sheet.cell(0,i) <> xlrd.empty_cell.value:
variable = sheet.cell(2,i).value
#other body stuff
Else:
break
I know this code is rather undeveloped, I just wanted to ask before going in the wrong direction. I also am unsure how to assign i to run through the rows.
If what you need is to read the excel in python, I'd recommend taking a look at pandas read_excel.
Hope this helps!

Limit max number of columns displayed from a data frame in Pandas [duplicate]

This question already has answers here:
Selecting pandas column by location
(7 answers)
Closed 3 years ago.
I am trying to display 4 data frames in a list from a web-scraper project I'm working on. I'm new to Python/Pandas, and am trying to write a for loop to do this. My thinking is if I can set the display restrictions, it will just do this for each data frame in my list, if it means anything, I'm working out of a Jupyter Notebook. The only thing is, that I need to limit the number of columns, not rows, shown to only the first 5 (columns 0 - 4). I'm kind of at a loss here on how to do this.
I've tried to set up the initial loop as seen below, and I'm able to display each of my data-frames correctly, just not limited on columns like I want. I would also like to figure out how to add a header if you will, like a chart title in Excel to each, but that's a little less urgent at the moment.
Players = [MJ,KB,LJ,SC]
for players in Players:
display(players)
Additional information is that each data-frame has 11 columns, each df is stored to the corresponding variable in the list above.
Check out this link:
https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html
There is an option that you can set using:
import pandas as pd
pd.set_option("display.max_columns", 4)
After doing so you can use display(players) and it should only show the first 4 columns

remove for loop for df.drop [duplicate]

This question already has answers here:
How to exclude multiple columns in Spark dataframe in Python
(4 answers)
Closed 4 years ago.
I am working with pyspark 2.0.
My code is :
for col in to_exclude:
df = df.drop(col)
I cannot do directly df = df.drop(*to_exclude) because in 2.0, drop method accept only 1 column at a time.
Is there a way to change my code and remove the for loop ?
First of all - worries not. Even if you do it in loop, it does not mean Spark executes separate queries for each drop. Queries are lazy, so it will build one big execution plan first, and then executes everything at once. (but you probably know it anyway)
However, if you still want to get rid of the loop within 2.0 API, I’d go with something opposite to what you’ve implemented: instead of dropping columns, select only needed:
df.select([col for col in df.columns if col not in to_exclude])

Categories

Resources