Exporting several scraped tables into a single CSV File - python

How can I concatenate the tables read from several HTML? I understand they are considered lists and lists are not possible to concatenate, but then how can I insert more than one table scraped from a different URL into one single CSV? Any ideas? Is it possible to save the print output in a variable and then move it into a CSV?
import pandas as pd
df = pd.read_html('URL')
df1 = pd.read_html('URL')
print(df, df1)
(**df,df1**).to_csv('name.csv')
The attribute (df,df1) is of course incorrect, just wrote it to describe what I am missing.
Thank you very much in advance

pd.read_html returns a list of dataframes. So, in case you are sure that the lists contains dataframes formated in a way that can be concatenated you can consolidate then into a single dataframe, then export it to csv:
import pandas as pd
dframes_list1 = pd.read_html('URL1')
dframes_list2 = pd.read_html('URL2')
dframes_all = dframes_list1 + dframes_list2
consolidated_dframe = pd.concat(dframes_all)
consolidated_dframe.to_csv('name.csv')

Related

Creating list from imported CSV file with pandas

I am trying to create a list from a CSV. This CSV contains a 2 dimensional table [540 rows and 8 columns] and I would like to create a list that contains the values of an specific column, column 4 to be specific.
I tried: list(df.columns.values)[4], it does mention the name of the column but i'm trying to get the values from the rows on column 4 and make them a list.
import pandas as pd
import urllib
#This is the empty list
company_name = []
#Uploading CSV file
df = pd.read_csv('Downloads\Dropped_Companies.csv')
#Extracting list of all companies name from column "Name of Stock"
companies_column=list(df.columns.values)[4] #This returns the name of the column.
companies_column = list(df.iloc[:,4].values)
So for this you can just add the following line after the code you've posted:
company_name = df[companies_column].tolist()
This will get the column data in the companies column as pandas Series (essentially a Series is just a fancy list) and then convert it to a regular python list.
Or, if you were to start from scratch, you can also just use these two lines
import pandas as pd
df = pd.read_csv('Downloads\Dropped_Companies.csv')
company_name = df[df.columns[4]].tolist()
Another option: If this is the only thing you need to do with your csv file, you can also get away just using the csv library that comes with python instead of installing pandas, using this approach.
If you want to learn more about how to get data out of your pandas DataFrame (the df variable in your code), you might find this blog post helpful.
I think that you can try this for getting all the values of a specific column:
companies_column = df[{column name}]
Replace "{column name}" with the column you want to access the values of.

How to add columns and delete columns in a Json file then save it into csv

I have tried to use dataframe to add columns and values into the json file, but it seems like after I tried to delete some columns it return to the original data file. I also faced a problem of not able to save it into a csv file. So was wondering maybe I cannnot use dataframe for this?
It is like a list and divided into different columns (around 30 rows in total), however some I would like to delete like the route and urls, while adding three columns with length, maxcal, mincal (all the values in these 3 columns are found in the route columns)
I had done this so far and got stucked:
import pandas as pd
import json
data = pd.read_json('fitness.json') # fitness.json is the filename of the json file
fitness2 = pd.DataFrame (fitness2)
fitness2
data.join(fitness2, lsuffix="_left") # to join the three columns into the data table
I am not sure how can I delete the columns of route, 'MapURL', 'MapURL_tc', 'MapURL_sc' then finally save as a csv like the output shown.
Thank you.
you can drop columns and then concat the two dataframes:
data.drop(['MapURL', 'MapURL_tc', 'MapURL_sc'], inplace=True, axis=1)
pd.concat([data,fitness2], axis=1) # to join the three columns into the data table

Normalize column with JSON data in Pandas dataframe

I have a Pandas dataframe in which one column contains JSON data (the JSON structure is simple: only one level, there is no nested data):
ID,Date,attributes
9001,2020-07-01T00:00:06Z,"{"State":"FL","Source":"Android","Request":"0.001"}"
9002,2020-07-01T00:00:33Z,"{"State":"NY","Source":"Android","Request":"0.001"}"
9003,2020-07-01T00:07:19Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
9004,2020-07-01T00:11:30Z,"{"State":"NY","Source":"windows","Request":"0.001"}"
9005,2020-07-01T00:15:23Z,"{"State":"FL","Source":"ios","Request":"0.001"}"
I would like to normalize the JSON content in the attributes column so the JSON attributes become each a column in the dataframe.
ID,Date,attributes.State, attributes.Source, attributes.Request
9001,2020-07-01T00:00:06Z,FL,Android,0.001
9002,2020-07-01T00:00:33Z,NY,Android,0.001
9003,2020-07-01T00:07:19Z,FL,ios,0.001
9004,2020-07-01T00:11:30Z,NY,windows,0.001
9005,2020-07-01T00:15:23Z,FL,ios,0.001
I have been trying using Pandas json_normalize which requires a dictionary. So, I figure I would convert the attributes column to a dictionary but it does not quite work out as expected for the dictionary has the form:
df.attributes.to_dict()
{0: '{"State":"FL","Source":"Android","Request":"0.001"}',
1: '{"State":"NY","Source":"Android","Request":"0.001"}',
2: '{"State":"FL","Source":"ios","Request":"0.001"}',
3: '{"State":"NY","Source":"windows","Request":"0.001"}',
4: '{"State":"FL","Source":"ios","Request":"0.001"}'}
And the normalization takes the key (0, 1, 2, ...) as the column name instead of the JSON keys.
I have the feeling that I am close but I can't quite work out how to do this exactly. Any idea is welcome.
Thank you!
Normalize expects to work on an object, not a string.
import json
import pandas as pd
df_final = pd.json_normalize(df.attributes.apply(json.loads))
You shouldn’t need to convert to a dictionary first.
Try:
import pandas as pd
pd.json_normalize(df[‘attributes’])
I found an solution but I am not overly happy with it. I reckon it is very inefficient.
import pandas as pd
import json
# Import full dataframe
df = pd.read_csv(r'D:/tmp/sample_simple.csv', parse_dates=['Date'])
# Create empty dataframe to hold the results of data conversion
df_attributes = pd.DataFrame()
# Loop through the data to fill the dataframe
for index in df.index:
row_json = json.loads(df.attributes[index])
normalized_row = pd.json_normalize(row_json)
# df_attributes = df_attributes.append(normalized_row) (deprecated method) use concat instead
df_attributes = pd.concat([df_attributes, normalized_row], ignore_index=True)
# Reset the index of the attributes dataframe
df_attributes = df_attributes.reset_index(drop=True)
# Drop the original attributes column
df = df.drop(columns=['attributes'])
# Join the results
df_final = df.join(df_attributes)
# Show results
print(df_final)
print(df_final.info())
Which gives me the expected result. However, as I said, there are several inefficiencies in it. For starters, the dataframe append in the for loop. According to the documentation the best practice is to make a list and then append but I could not figure out how to do that while keeping the shape I wanted. I welcome all critics and ideas.

how can I extract specific row which contain specific keyword from my json dataset using pandas in python?

sorry that might be very simple question but I am new to python/json and everything. I am trying to filter my twitter json data set based on user_location/country_code/gb. but I have no idea how to do this. I have tried several ways but still no chance. I have attached my data set and some codes I have used here. I would appreciate any help.
here is what I did to get the best result however I do not know how to tell it to go for whole data set and print out the result of tweet_id:
import json
import pandas as pd
df = pd.read_json('example.json', lines=True)
if df['user_location'][4]['country_code'] == 'th':
print (df.tweet_id[4])
else:
print('false')
this code show me the tweet_id : 1223489829817577472
however, I couldn't extend it to the whole data set.
I have tried theis code as well, still no chance:
dataset = df[df['user_location'].isin([ "gb" ])].copy()
print (dataset)
that is what my data set looks like:
I would break the user_location column into multiple columns using the following
df = pd.concat([df, df.pop('user_location').apply(pd.Series)], axis=1)
Running this should give you a column each for the keys contained within the user_location json. Then it should be easy to print out tweet_ids based on country_code using:
df[df['country_code']=='th']['tweet_id']
An explanation of what is actually happening here:
df.pop('user_location') removes the 'user_location' column from df and returns it at the same time
With the returned column, we use the .apply method to apply a function to the column
pd.Series converts the JSON data/dictionary into a DataFrame
pd.concat concatenates the original df (now without the 'user_location' column) with the new columns created from the 'user_location' data

Python2.7: How to split a column into multiple column based on special strings like this?

I'm a newbie for programming and python, so I would appreciate your advice!
I have a dataframe like this.
In 'info' column, there are 7 different categories: activities, locations, groups, skills, sights, types and other. and each categories have unique values within [ ].(ie,"activities":["Tour"])
I would like to split 'info' column into 7 different columns based on each category as shown below.
I would like to allocate appropriate column names and also put corresponding unique strings within [ ] to each row.
Is there any easy way to split dataframe like that?
I was thinking to use str.split functions to split into pieces and merge everthing later. But not sure that is the best way to go and I wanted to see if there is more sophisticated way to make a dataframe like this.
Any advice is appreciated!
--UPDATE--
When print(dframe['info']), it shows like this.
It looks like the content of the info column is JSON-formatted, so you can parse that into a dict object easily:
>>> import json
>>> s = '''{"activities": ["Tour"], "locations": ["Tokyo"], "groups": []}'''
>>> j = json.loads(s)
>>> j
{u'activities': [u'Tour'], u'locations': [u'Tokyo'], u'groups': []}
Once you have the data as a dict, you can do whatever you like with it.
Ok, here is how to do it :
import pandas as pd
import ast
#Initial Dataframe is df
mylist = list(df['info'])
mynewlist = []
for l in mylist:
mynewlist.append(ast.literal_eval(l))
df_info = pd.DataFrame(mynewlist)
#Add columns of decoded info to the initial dataset
df_new = pd.concat([df,df_info],axis=1)
#Remove the column info
del df_new['info']
You can use the json library to do that.
1) import the json libray
import json
2) Turn into string all the rows of that column and then Apply the json.loads function to all of them. Insert the result in an object
jsonO = df['info'].map(str).apply(json.loads)
3)The Json object is now a json dataframe in which you can navigate. For each columns of your Json dataframe, create a column in your final dataframe
df['Activities'] = jsonO.apply(lambda x: x['Activities'])
Here for one column of your json dataframe each 'rows' is dump in the new column of your final dataframe df
4) Re-do 3 for all the columns you're interested in

Categories

Resources