Copying columns data to a new sheet using a reference column - python

Here is a snip of my data for reference
Column A contains 9134 study IDs and column B contains 9467 study IDs. I have previously applied exclusions to column A (it was once column B before I deleted certain people due t exclusion) Columns C-G is the new data which corresponds to column B (there are 9467 rows of the data)
What I am looking to do is match the data to column A..for example "Study ID 2 #1195" I would like that data to line up with that Study ID 1 #1195 in column A
ALSO, because I intentionally deleted some people from column A, there are Study ID 2 #'s that have been deleted from Study ID 1..so I am not interested in those people
The ultimate goal is to line up the data with column A in order to copy and paste seamlessly into an existing SPSS database
I am not sure how to go about this. Any help would be greatly appreciated!!
HERE IS A LINK TO MY DATASET
https://drive.google.com/open?id=1lhbuthqNPLLi8KRVmCOEqmBkh5dy2C41

This seems like more of an excel question than any kind of coding question. I think all you want to do is keep records where the study id exists in column A.
This can be achieved in Excel y doing a vlookup on column b where you check if a record exists in column A.
You can then filter out any records that don't exist and copy that information to a new sheet.
Example:
Add new column
Apply filter:

Related

How to extract values based on column header in excel?

I have an excel file containing values, I needed values as the highlighted one in single column and deleting the rest on. But due to mismatch in rows and column header file, I am not able to extract. Once you will see the excel will able to understand what values I needed.As this is just a sample of mine data.
Column A2:A17 date is continuous but few date are repeating, but in Row (D1:K1) date are not repeating, so in this case value of same date occurring just below of of one other.
How to get values in one column?
Is there a way to highlight the values of same date occurring in row and column? The sample data consist of manually highlighted. I have huge dataset that cannot be manually highlighted.
Because from colour code also I can get the required values too.
Following is the file I am attaching here
https://docs.google.com/spreadsheets/d/1-xBMKRP1_toA_Ky8mKxCKAFi4uQ8YWJq/edit?usp=sharing&ouid=110042758694954349181&rtpof=true&sd=true
Please visit the link and help me to find the solution.
Thank you
I'm not clear what those values in columns D to K are.
If only the shaded ones matter and they can be derived from the Latitude and Longitude for each row separately:
Insert a column titled "Row", say in A, and populate it 1,2,3...
I think you also want a column E which is whatever the calculation you currently have in D-K. Is this "Distance"?
Then create a Pivot Table on rows A to E and you can do anything you are likely to need: https://support.microsoft.com/en-us/office/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576
Dates at Colum Labels, Row numbers as Row Labels, and Sum of "Distance" as Values.

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Pandas convert data from two tables into third table. Cross Referencing and converting unique rows to columns

I have the following tables:
Table A
listData = {'id':[1,2,3],'date':['06-05-2021','07-05-2021','17-05-2021']}
pd.DataFrame(listData,columns=['id','date'])
Table B
detailData = {'code':['D123','F268','A291','D123','F268','A291'],'id':['1','1','1','2','2','2'],'stock':[5,5,2,10,11,8]}
pd.DataFrame(detailData,columns=['code','id','stock'])
OUTPUT TABLE
output = {'code':['D123','F268','A291'],'06-05-2021':[5,5,2],'07-05-2021':[10,11,8]}
pd.DataFrame(output,columns=['code','06-05-2021','07-05-2021'])
Note: The code provided is hard coded code for the output. I need to generate the output table from Table A and Table B
Here is brief explanation of how the output table is generated if it is not self explanatory.
The id column needs to be cross reference from Table A to Table B and the dates should be put instead in Table B
Then all the unique dates in Table B should be made into columns and the corresponding stock values need to be shifted to then newly created date columns.
I am not sure where to start to do this. I am new to pandas and have only ever used it for simple data manipulation. If anyone can suggest me where to get started, it will be of great help.
Try:
tableA['id'] = tableA['id'].astype(str)
tableB.merge(tableA, on='id').pivot('code', 'date', 'stock')
Output:
date 06-05-2021 07-05-2021
code
A291 2 8
D123 5 10
F268 5 11
Details:
First, merge on id, this is like doing a SQL join. First, the
dtypes much match, hence using astype to str.
Next, reshape the dataframe using pivot to get code by date.

Adding rows to another variable. The rows are already available in another csv file

There are 1919 rows and 12 columns in my file. There is a column named Genres that tells about the Genres of the games.
SAMPLE DATA:
Genres
Games, Strategy, Puzzle
Games, Entertainment, Action
...
...
Games, Strategy, Puzzle.
In such a way there are 1919 rows. I want to select rows that have the puzzle in them and store those entire rows in a separate variable without harming the original document. Just like copy and paste
I guess you have string type Genres column then answer would be
Using contains to get boolean index where word match and in the end reset index for new dataframe.
df2 = df[df.Genres.str.contains('Puzzle')].reset_index(drop=True)
df2 = df[df['Genres'] == 'Puzzle'].copy()
This will copy all rows with Genres == Puzzle into new dataframe df2

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Categories

Resources