List index out of range when specifying column in numpy - python

I have been tasked with extracting data from a specific column of a csv file using numpy and loadtxt. This data is on column D of the attatched image. By my logic i should use the numpy paramter usecols=3 to only obtain the 4th column which is the one I want. But my output keeps telling me that the index is out of range when there is clearly a column there. I have done some prior searching and the general consensus seems to be that one of the rows doesn't have any data in the column. But i have checked and all the rows have data in the column. Here is the code Im using highlighted in green.Can anyone possibly tell me why this is happening?
data = open("suttonboningtondata_moodle.csv","r")
min_temp = loadtxt(data,usecols=(3),skiprows=5,dtype=str,delimiter=" ")
print(min_temp)

I will suggest you use another library to extract your data. The pandas library works well in this regard.
Here is a documentation link to guide you.
pandas docs

I added a comma instead of whitespace for the delimiter value and it worked. I have no idea why though

Related

Converting 0-1 values in dataset with the name of the column if the value of the cell is 1

I have a csv dataset with the values 0-1 for the features of the elements. I want to iterate each cell and replace the values 1 with the name of its column. There are more than 500 thousand rows and 200 columns and, because the table is exported from another annotation tool which I update often, I want to find a way in Python to do it automatically.
This is not the table, but a sample test which I was using while trying to write a code I tried some, but without success.
I would really appreciate it if you can share your knowledge with me. It will be a huge help. The final result I want to have is of the type: (abonojnë, token_pos_verb). If you know any method that I can do this in Excel without the help of Python, it would be even better.
Thank you,
Brikena
Text,Comment,Role,ParentID,doc_completeness,lemma,MultiWord_Expr,token,pos,punctuation,verb,noun,adjective
abonojnë,,,,,,,1,1,0,1,0,0
çokasin,,,,,,,1,1,0,1,0,1
gërgasin,,,,,,,1,1,0,1,0,0
godasin,,,,,,,1,1,0,1,0,0
përkasin,,,,,,,1,1,1,1,0,0
përdjegin,,,,,,,1,1,0,1,0,0
lakadredhin,,,,,,,1,1,0,1,1,0
përdredhin,,,,,,,1,1,0,1,0,0
spërdredhin,,,,,,,1,1,0,1,0,0
përmbledhin,,,,,,,1,1,0,1,0,0
shpërdredhin,,,,,,,1,1,0,1,0,0
arsejnë,,,,,,,1,1,0,1,1,0
çapëlejnë,,,,,,,1,1,0,1,0,0
Using pandas, this is quite easy:
# pip install pandas
import pandas as pd
# read data (here example with csv, but use "read_excel" for excel)
df = pd.read_csv('input.csv').set_index('Text')
# reshape and export
(df.mul(df.columns).where(df.eq(1))
.stack().rename('xxx')
.groupby(level=0).apply('_'.join)
).to_csv('output.csv') # here use "to_excel" for excel format
output file:
Text,xxx
abonojnë,token_pos_verb
arsejnë,token_pos_verb_noun
godasin,token_pos_verb
gërgasin,token_pos_verb
lakadredhin,token_pos_verb_noun
përdjegin,token_pos_verb
përdredhin,token_pos_verb
përkasin,token_pos_punctuation_verb
përmbledhin,token_pos_verb
shpërdredhin,token_pos_verb
spërdredhin,token_pos_verb
çapëlejnë,token_pos_verb
çokasin,token_pos_verb_adjective
An update to those who may find it helpful in the future. Thank you to #mozway for helping me. A friend of mine suggested working with Excel formula because the solution with Pandas and gropuby eliminates duplicates. Since I need all the duplicates, because it's an annotated corpus, it's normal that there are repeated words that should appear in every context, not only the first occurrence.
The other alternative is this:
Use a second sheet on the excel file, writing the formula =IF(Sheet1!B2=1,Sheet2!B$1,"") in the first cell with 0-1 values and drag it in all the other cells. This keeps all the occurrences of the words. It's quick and it works like magic.
I hope this can be helpful to others who want to convert a 0-1 dataset to feature names without having to code.

How to transform values from dataframe(python) attributes from a row into columns?

I have the following dataframe Current Dataframe loaded from a csv, that I want to use do some sampling tests.
For that I wanted to use all of the current columns, but trying to transform Element_Count and Tag_Count into separate columns from the values from Element_Count(e.g link: 10) and Tag_Count(separately).
I want to extract each value and turn it into a column. The final dataframe would be something like this(obviously depending on the values inside of Element/Tag_Count) :
Index (the 0,1,2 etc from the dataframe its self) PageID ,Uri, A, AA, AAA, link (and its value inside of Element_Count, e.g link as column and in the case of the first one in the picture 44 in the row for that specific url) etc, html, etc (with all the values of Tag_Count present in all of the content inside of the rows of the column Tag_Count as explained for Element_Count)
The current code to generate the dataframe is the following:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore") #to ignore some warnings which have no effect in this particular case.
df = pd.read_csv('test.csv', sep=';')
df.head()
I have searched google, also in here for some answers to no avail.
Have tried changing the test csv to achieve my goal, with no success. Have also tried after seeing a question on here to use:
pd.DataFrame(df.ranges.str.split(',').tolist())
to achieve the desired result with no success.
Any ideas in how to achieve this via dataframes, or by any other method?
(Anything that I have forgot to mention that u feel is important to understand the problem please say and I will edit it in)
Edit :
Although logic would say that the element and tag count arrays should be in dictionary form and easily dividable, that is not the case as shown in the print

Save Pandas dataframe with numeric column as text in Excel

I am trying to export a Pandas dataframe to Excel where all columns are of text format. By default, the pandas.to_excel() function lets Excel decide the data type. Exporting a column with [1,2,'w'] results in the cells containing 1 and 2 to be numeric, and the cell containing 'w' to be text. I'd like all rows in the column to be text (i.e. ['1','2','w']).
I was able to solve the problem by assigning the column I need to be text using the .astype(str). However, if the data is large, I am concerned that I will run into performance issues. If I understand correctly, df[col] = df[col].astype(str) makes a copy of the data, which is not efficient.
import pandas as pd
df = pd.DataFrame({'a':[1,2,'w'], 'b':['x','y','z']})
df['a'] = df['a'].astype(str)
df.to_excel(r'c:\tmp\test.xlsx')
Is there a more efficient way to do this?
I searched SO several times and didn't see anything on this. Forgive me if this has been answered before. This is my first post, and I'm really happy to participate in this cool forum.
Edit: Thanks to the comments I've received, I see that Converting a series of ints to strings - Why is apply much faster than astype? gives me other options to astype(str). This is really useful. I also wanted to know if astype(str) was inefficient because it made a copy of the data, which I now see that it does not.
I don't think that you'll not have performance issues with that approach since data is not copied but replaced. You may also convert the whole dataframe into string type using
df = df.astype(str)

Merging and cleaning up csv files in Python

I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!

python pandas dataframe to make r(t-1).iloc[1:] = r(t).iloc[0:-1]

I want to make a column of a data frame of pandas, the second row to the last row equal to another column's first row to the second last row:
it look like:
r(t-1).iloc[1:] = r(t).iloc[0:-1](r(t) and r(t-1))
are columns of the same data frame
the problem I met is python wouldn't get my idea that I want to shift of row of copying data :
IR['r(t-1)'].iloc[1:] = IR['r(t)'].iloc[0:-1]
/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py:190:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
any one know how to deal with it?
Welcome to SO.
How to solve your problem
It seems you need a shift in one of your columns. as the first two lines of your question indicate. you can do it by shift() method for pandas dataframes. Then your answer could possibly would be:
df['new_row'] = df['old_row'].shift(1)
You can shift forward or backward with negative and positive values of shifting
What is the error you have ran into
Briefly speaking, pandas has provided ways of data frame writing and reading, and it does not actually accept any other kind of doing that. The warning shows that you are not using a good way (in pandas humble opinion!) of writing to the dataframe (using iloc) as iloc is mostly used for accessing to rows by index.
Any pandas expert here, please correct me if I am wrong.

Categories

Resources