How to get column name in tab separated dataframe? - python

I have a tab separated file which I extracted in pandas dataframe as below:
import pandas as pd
data1 = pd.DataFrame.from_csv(r"C:\Users\Ashish\Documents\indeed_ml_dataset\train.tsv", sep="\t")
data1
Here is how the data1 looks like:
Now, I want to view the column name tags. I don't know whether I should call it a column or not, but I have tried accessing it using the norm:
data2=data1[['tags']]
but it errors out. I have tried several other things as well using index and loc, but all of them fails. Any suggestions?

To fix this you'll need to remove description from the index by resetting. Try the below:
data2 = data1.reset_index()
data2['tags']
You'll then be able to select by "tags".

Try reading your data using pd.read_csv instead of pd.DataFrame.from_csv as it takes first column as index by default.
For more info refer to this documentation on pandas website: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html

Related

List index out of range when specifying column in numpy

I have been tasked with extracting data from a specific column of a csv file using numpy and loadtxt. This data is on column D of the attatched image. By my logic i should use the numpy paramter usecols=3 to only obtain the 4th column which is the one I want. But my output keeps telling me that the index is out of range when there is clearly a column there. I have done some prior searching and the general consensus seems to be that one of the rows doesn't have any data in the column. But i have checked and all the rows have data in the column. Here is the code Im using highlighted in green.Can anyone possibly tell me why this is happening?
data = open("suttonboningtondata_moodle.csv","r")
min_temp = loadtxt(data,usecols=(3),skiprows=5,dtype=str,delimiter=" ")
print(min_temp)
I will suggest you use another library to extract your data. The pandas library works well in this regard.
Here is a documentation link to guide you.
pandas docs
I added a comma instead of whitespace for the delimiter value and it worked. I have no idea why though

How can I read a document with pandas (python) that don't look like the average one?

I am trying to get the values from the colums from a file. The doc looks like this
the data I want to read
So all the examples I have found uses pd.read_csv or pd.DataFrame but all data usually have a clear header an nothing on top of the data (I have like 10 lines of code I don't really need for what I am doing).
Also, I think maybe there is something wrong because I tried to run:
data = pd.read_csv('tdump_BIL_100_-96_17010100.txt',header=15)
and I get
pd.read_csv output
which is just the row in one column, so there is no separation apparently, and therefore no way of getting the columns I need.
So my question is if there is a way to get the data from this file with pandas and how to get it.
If a defined number, skip the initial rows, indicate that no header is present, and that values are separated by spaces.
df = pd.read_csv(filename, skiprows=15, header=None, sep='\s+')
See read_csv for documentation.

How to transform values from dataframe(python) attributes from a row into columns?

I have the following dataframe Current Dataframe loaded from a csv, that I want to use do some sampling tests.
For that I wanted to use all of the current columns, but trying to transform Element_Count and Tag_Count into separate columns from the values from Element_Count(e.g link: 10) and Tag_Count(separately).
I want to extract each value and turn it into a column. The final dataframe would be something like this(obviously depending on the values inside of Element/Tag_Count) :
Index (the 0,1,2 etc from the dataframe its self) PageID ,Uri, A, AA, AAA, link (and its value inside of Element_Count, e.g link as column and in the case of the first one in the picture 44 in the row for that specific url) etc, html, etc (with all the values of Tag_Count present in all of the content inside of the rows of the column Tag_Count as explained for Element_Count)
The current code to generate the dataframe is the following:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore") #to ignore some warnings which have no effect in this particular case.
df = pd.read_csv('test.csv', sep=';')
df.head()
I have searched google, also in here for some answers to no avail.
Have tried changing the test csv to achieve my goal, with no success. Have also tried after seeing a question on here to use:
pd.DataFrame(df.ranges.str.split(',').tolist())
to achieve the desired result with no success.
Any ideas in how to achieve this via dataframes, or by any other method?
(Anything that I have forgot to mention that u feel is important to understand the problem please say and I will edit it in)
Edit :
Although logic would say that the element and tag count arrays should be in dictionary form and easily dividable, that is not the case as shown in the print

How to populate DataFrame column from JSON string element in another column

I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks

How can I create index for python pandas dataframe?

I am importing several csv files into python using Jupyter notebook and pandas and some are created without a proper index column. Instead, the first column, which is data that I need to manipulate is used. How can I create a regular index column as first column? This seems like a trivial matter, but I can't find any useful help anywhere.
What my dataframe looks like
What my dataframe should look like
Could you please try this:
df.reset_index(inplace = True, drop = True)
Let me know if this works.
When you are reading in the csv, use pandas.read_csv(index_col= #, * args). If they don't have a proper index column, set index_col=False.
To change indices of an existing DataFrame df, try the methods df = df.reset_index() or df=df.set_index(#).
When you imported your csv, did you use the index_col argument? It should default to None, according to the documentation. If you don't use the argument, you should be fine.
Either way, you can force it not to use a column by using index_col=False. From the docs:
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
Python 3.8.5
pandas==1.2.4
pd.read_csv('file.csv', header=None)
I found the solution in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Categories

Resources