actually I am fetching data from database
which has data as below CSV format
DS,SID,SID_T,E_DATE,S_DATA
TECH,312,TID,2021-01-03,"{""idx"":""msc"",""cid"":""f323d3"",""iname"":""master_in_science"",""mcap"":21.33,""sg"":[{""upt"":true,""dwt"":true,""high_low"":false}]}"
TECH,343,TID,2021-01-03,"{""idx"":""bsc"",""cid"":""k33d3"",""iname"":""bachelor_in_science"",""mcap"":81.33,""sg"":[{""upt"":false,""dwt"":true,""high_low"":false}]}"
TECH,554,TID,2021-01-03,"{""idx"":""ba"",""cid"":""3d3f32"",""iname"":""bachelor_in_art"",""mcap"":67.83,""sg"":[{""upt"":true,""dwt"":false,""high_low"":false}]}"
TECH,323,TID,2021-01-03,"{""idx"":""ma"",""cid"":""m23k66"",""iname"":""master_in_art"",""mcap"":97.13,""sg"":[{""upt"":true,""dwt"":true,""high_low"":true}]}"
dataframe look like this
i wanted to split the S_DATA column into multiple column
hence the output is like this
what i have tried
i tried to convert the dataframe into json + and trying to normalize the json using pandas.normalize
but i was unable to do the so
also S_DATA.sg values i.e.""sg"":[{""upt"":true,""dwt"":true,""high_low"":false} is also creating the trouble while the entire conversion process
Try this:
import pandas as pd
import json
df = pd.read_csv('data.csv')
df['S_DATA'] = df['S_DATA'].apply(lambda x: json.loads(x))
pd.concat([df[df.columns.difference(['S_DATA'])], pd.json_normalize(df.S_DATA)], axis=1)
Related
I am trying to create a list from a CSV. This CSV contains a 2 dimensional table [540 rows and 8 columns] and I would like to create a list that contains the values of an specific column, column 4 to be specific.
I tried: list(df.columns.values)[4], it does mention the name of the column but i'm trying to get the values from the rows on column 4 and make them a list.
import pandas as pd
import urllib
#This is the empty list
company_name = []
#Uploading CSV file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Extracting list of all companies name from column "Name of Stock"
companies_column=list(df.columns.values)[4] #This returns the name of the column.
companies_column = list(df.iloc[:,4].values)
So for this you can just add the following line after the code you've posted:
company_name = df[companies_column].tolist()
This will get the column data in the companies column as pandas Series (essentially a Series is just a fancy list) and then convert it to a regular python list.
Or, if you were to start from scratch, you can also just use these two lines
import pandas as pd
df = pd.read_csv('Downloads\Dropped_Companies.csv')
company_name = df[df.columns[4]].tolist()
Another option: If this is the only thing you need to do with your csv file, you can also get away just using the csv library that comes with python instead of installing pandas, using this approach.
If you want to learn more about how to get data out of your pandas DataFrame (the df variable in your code), you might find this blog post helpful.
I think that you can try this for getting all the values of a specific column:
companies_column = df[{column name}]
Replace "{column name}" with the column you want to access the values of.
I am new to Python and I am trying to read a csv file using pandas but I have a bit of a problem within my csv file.
I have strings which contains commas at the end and this creates an undesired column at towards the end as shown:
This is the raw csv:
For example, on line 14, the green string value ends with a comma and creates a new column which then gives me parsing errors when using this:
import pandas as pd
pd.read_csv("data.csv")
ParserError: Error tokenizing data. C error: Expected 6 fields in line 8, saw 7
Is there a way I can clean up this and merge the last two columns?
You can use np.where to replace APP with the last column where APP is missing, then drop the last column.
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
df['APP'] = np.where(df.app.isna(), df[-1], df.APP)
df = df.iloc[:,:-1]
sorry that might be very simple question but I am new to python/json and everything. I am trying to filter my twitter json data set based on user_location/country_code/gb. but I have no idea how to do this. I have tried several ways but still no chance. I have attached my data set and some codes I have used here. I would appreciate any help.
here is what I did to get the best result however I do not know how to tell it to go for whole data set and print out the result of tweet_id:
import json
import pandas as pd
df = pd.read_json('example.json', lines=True)
if df['user_location'][4]['country_code'] == 'th':
print (df.tweet_id[4])
else:
print('false')
this code show me the tweet_id : 1223489829817577472
however, I couldn't extend it to the whole data set.
I have tried theis code as well, still no chance:
dataset = df[df['user_location'].isin([ "gb" ])].copy()
print (dataset)
that is what my data set looks like:
I would break the user_location column into multiple columns using the following
df = pd.concat([df, df.pop('user_location').apply(pd.Series)], axis=1)
Running this should give you a column each for the keys contained within the user_location json. Then it should be easy to print out tweet_ids based on country_code using:
df[df['country_code']=='th']['tweet_id']
An explanation of what is actually happening here:
df.pop('user_location') removes the 'user_location' column from df and returns it at the same time
With the returned column, we use the .apply method to apply a function to the column
pd.Series converts the JSON data/dictionary into a DataFrame
pd.concat concatenates the original df (now without the 'user_location' column) with the new columns created from the 'user_location' data
I am trying to create a dataframe where the column lengths are not equal. How can I do this?
I was trying to use groupby. But I think this will not be the right way.
import pandas as pd
data = {'filename':['file1','file1'], 'variables':['a','b']}
df = pd.DataFrame(data)
grouped = df.groupby('filename')
print(grouped.get_group('file1'))
Above is my sample code. The output of which is:
What can I do to just have one entry of 'file1' under 'filename'?
Eventually I need to write this to a csv file.
Thank you
If you only have one entry in a column the other will be NaN. So you could just filter the NaNs by doing something like df = df.at[df["filename"].notnull()]
I am trying to import a file from xlsx into a Python Pandas dataframe. I would like to prevent fields/columns being interpreted as integers and thus losing leading zeros or other desired heterogenous formatting.
So for an Excel sheet with 100 columns, I would do the following using a dict comprehension with range(99).
import pandas as pd
filename = 'C:\DemoFile.xlsx'
fields = {col: str for col in range(99)}
df = pd.read_excel(filename, sheetname=0, converters=fields)
These import files do have a varying number of columns all the time, and I am looking to handle this differently than changing the range manually all the time.
Does somebody have any further suggestions or alternatives for reading Excel files into a dataframe and treating all fields as strings by default?
Many thanks!
Try this:
xl = pd.ExcelFile(r'C:\DemoFile.xlsx')
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})
UPDATE:
In [261]: type(xl)
Out[261]: pandas.io.excel.ExcelFile
In [262]: type(xl.book)
Out[262]: xlrd.book.Book
Use dtype=str when calling .read_excel()
import pandas as pd
filename = 'C:\DemoFile.xlsx'
df = pd.read_excel(filename, dtype=str)
the usual solution is:
read in one row of data just to get the column names and number of columns
create the dictionary automatically where each columns has a string type
re-read the full data using the dictionary created at step 2.